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CHAPTER 1 


Introduction 


In this book, I assume that you are familiar with Python programming. 

In this introductory chapter, I explain why a data scientist should choose 
Python as a programming language. Then I highlight some situations 
where Python is not a good choice. Finally, I descrihe some good practices 
in application development and give some coding examples that a data 
scientist needs in their day-to-day joh. 


Why Python? 

So, why should you choose Python? 

• It has versatile libraries. You always have a ready- 
made library in Python for any Idnd of application. 
From statistical programming to deep learning to 
network application to web crawling to embedded 
Systems, you will always have a ready-made library in 
Python. If you learn this language, you do not have to 
stick to a specific use case. R has a rich set of analytics 
libraries, but if you are working on an Internet of Things 
(loT) application and need to code in a device-side 
embedded system, it will be difficult in R. 


© Sayan Mukhopadhyay 2018 

S. Mukhopadhyay, Advanced Data Analytics Using Python, 
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• It is very high performance. Java is also a versatile 
language and has lots of libraries, but Java code runs 
on a Java Virtual machine, which adds an extra layer 
of latency. Python uses high-performance libraries 
built in other languages. For example, SciPy uses 
LAPACK, which is a Fortran library for linear algebra 
applications. TensorFlow uses CUDA, which is a C 
library for parallel GPU processing. 

• It is simple and gives you a lot of freedom to code. 
Python syntax is just like a natural language. It is easy to 
remember, and it does not have constraints in variables 
(like constants or public/private). 


When to Avoid Using Python 

Python has some downsides too. 

• When you are writing very specific code, Python may 
not always be the best choice. For example, if you are 
writing code that deals only with statistics, R is a better 
choice. If you are writing MapReduce code only, Java is 
a better choice than Python. 

• Python gives you a lot of freedom in coding. So, when 
many developers are working on a large application, 
Java/C++ is a better choice so that one developer/ 
architect can put constraints on another developer’s 
code using public/private and constant keywords. 

• For extremely high-performance applications, there is 
no alternative to C/C++. 
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OOP in Python 

Before proceeding, I will explain some features of object-oriented 
programming (OOP) in a Python context. 

The most hasic element of any modern application is an ohject. To 
a programmer or architect, the world is a collection of objects. Objects 
consist of two types of members: attributes and methods. Members can be 
private, public, or protected. Classes are data types of objects. Every object 
is an instance of a class. A class can be inherited in child classes. Two 
classes can be associated using composition. 

In a Python context, Python has no keywords for public, private, or 
protected, so encapsulation (hiding a member from the outside world) 
is not implicit in Python. Like C++, it supports multilevel and multiple 
inheritance. Like Java, it has an abstract keyword. Classes and methods 
both can be abstract. 

The following code is an example of a generic web crawler that is 
implemented as an airline’s web crawler on the Sl<ytrax site and as a retail 
crawler for the Mouthshut.com site. Tll return to the topic of web crawling 
in Chapter 2. 

from abc import ABCMeta, abstractmethod 

import BeautifulSoup 

import urllib 

import sys 

import bleach 

#################### Root Class (Abstract) #################### 
class SkyThoughtCollector(object): 

_metaclass_ = ABCMeta 

baseURLString = "base_url" 
airlinesString = "air_lines" 
limitString = "limits" 
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baseURl = "" 
airlines = [] 
limit = 10 

@abstractmethod 

def collectThoughts(self): 

print "Something Wrong!! You're calling 
an abstract method" 

@classmethod 

def getConfig(self, configpath): 

#print "In get Config" 
config = {} 

conf = open(configpath) 
for line in conf: 

if ("#" not in line): 

words = line.stripO .split(' =') 
config[words[0] .stripO] = words[l]. 
stripO 

#print config 

self.baseURl = config[self.baseURLString] 
if config.has_key(self.airlinesString): 
self.airlines = config[self. 
airlinesString].split(',') 
if config.has_key(self.limitString): 

self.limit = int(config[self.limitString]) 
Sprint self.airlines 

def downloadURL(self, uri): 

Sprint "downloading uri" 
pageFile = urllib.urlopen(url) 
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if pageFile.getcodeO != 200: 

return "Problem in URL" 
pageHtml = pageFile.read() 
pageFile.closeO 
return .join(pageFltml) 

def reniove_junk(self, arg): 
f = open('junk.txt') 
for line in f: 

arg.replace(line.strip(),'') 
return arg 

def print_args(self, args): 
out =" 
last = 0 
for arg in args: 

if args.index(arg) == len(args) -1: 

last = 1 
reload(sys) 

sys.setdefaultencoding("utf-8") 
arg = arg.decode('utf8','ignore'). 
encode('ascii','ignore').strip() 
arg = arg.replace('\n',' ') 
arg = arg.replace('\r','') 
arg = self.refflove_junk(arg) 
if last == 0: 

out = out + arg + '\t' 

else: 

out = out + arg 

print out 
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####################### Airlines Chield ####################### 

class AirLineReviewCollector(SkyThoughtCollector): 

months = ['Danuary', 'February', 'March', 'April', 'May', 
'Dune', 'Duly', 'August', 'September', 'October', 'November', 
'December' ] 

def_init_(self, configpath): 

#print "In Config" 

super(AirLineReviewCollector,self).getConfig(configpath) 

def parseSoupHeader(self, header): 

Sprint "parsing header" 

name = surname = year = month = date = country ='' 

txt = header.find("h9") 

words = str(txt).stripO.split(' ') 

for j in range(len(words)-l): 

if words[j] in self.months: 
date = words[j-l] 
month= words[j] 
year = words[j+l] 
name = words[j+3] 
surname = words[j+4] 
if ")" in words[-1]: 

country = words[-l].split(')')[0] 
if "(" in country: 

country = country.split('(')[1] 

else: 

country = words[-2].split('(')[1] + country 
return (name, surname, year, month, date, country) 
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def parseSoupTable(self, table): 

#print "parsing table" 

images = table.findAll("img") 

over_all = str(images[o]).split("grn_bar_")[1]. 

split(".gif")[0] 

money_value = str(images[l]).split("SCORE_")[1]. 
split(".gif")[0] 

seat_conifort = str(images[2]).split("SCORE_")[l]. 
split(".gif")[0] 

staff_service = str(images[3]).split("SCORE_")[1]. 
split(".gif")[0] 

catering = str(images[4]).split("SCORE_")[1]. 
split(".gif")[0] 

entertainment = str(images[4]).split("SCORE_")[1]. 
split(".gif")[0] 
if 'YES' in str(images[6]): 
recommend = 'YES' 

else: 

recommend = 'NO' 

status = table.findAll("p", {"class":"text25"}) 
stat = str(status[2]).split(">")[1].split("<")[0] 
return (stat, over_all, money_value, seat_comfort, 
staff_service, catering, entertainment, recomend) 

def collectThoughts(self): 

#print "Collecting Thoughts" 
for al in AirLineReviewCollector.airlines: 
count = 0 

while count < AirLineReviewCollector.limit: 
count = count + 1 
uri = " 
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if count == 1: 

uri = AirLineReviewCollector. 
baseURl + al + ".htm" 

else: 

uri = AirLineReviewCollector. 
baseURl + al + "_"+str(count)+ 
".htm" 

Soup = BeautifulSoup.BeautifulSoup 

(super(AirLineReviewCollector,self). 

downloadURL(url)) 

blogs = soup.findAll("p", 

{"class":"text2"}) 

tables = soup.findAll("table", 

{"width":"l92"}) 

review_headers = soup.findAll("td", 

{"class":"airport"}) 

for i in range(len(tables)-l): 

(name, surname, year, month, 
date, country) = self.parse 
SoupHeader(review_headers[i]) 
(stat, over_all, money_value, 
seat_comfort, staff_service, 
catering, entertainment, 
recomend) = self.parseSoup 
Table(tables[i]) 
blog = str(blogs[i]). 
split(">")[l].split("<")[0] 
args = [al, name, surname, 
year, month, date, country, 
stat, over_all, money_value, 
seat_comfort, staff_service, 
catering, entertainment, 
recomend, blog] 
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super(AirLineReviewCo 
Hector, self). print_ 
args(args) 

######################## Retail Chield ######################## 

class RetailReviewCollector(SkyThoughtCollector): 

def_init_(self, configpath): 

#print "In Config" 

super(RetailReviewCollector,self).getConfig(configpath) 

def collectThoughts(self): 

Soup = BeautifulSoup.BeautifulSoup(super(RetailRev 
iewCollector,self).downloadURL(RetailReviewCollect 
or.baseURl)) 

lines = Soup.findAll("a",{"style": 

"font-size:15px;"}) 

links = [] 

for line in lines: 

if ("review" in str(line)) & ("target" in 
str(line)): 

In = str(line) 

link = ln.split("href=")[-l].split 
("target=")[0].replace\. 
stripO 

links.append(link) 

for link in links: 

Soup = BeautifulSoup.BeautifulSoup( 
super(RetailReviewCollector,self). 
downloadURL(link)) 
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comment = bleach.clean(str(soup.findAll("di 
v",{"itefflprop": "descriptiori"}) [0]),tags=[], 
strip=True) 

tables = soup.findAll("table", 

{"class":"smallfont spaceO pad2"}) 
parking = ambience = range = econotny = 
product = 0 
for table in tables: 

if "Parking:" in str(table): 

rows = table.findAll("tbody") 

[0].findAll("tr") 

for row in rows: 

if "Parking:" in 
str(row): 

parking = 
str(row). 
count("read- 
barfull") 
if "Ambience" in 
str(row): 

ambience = 
str(row). 
count("read- 
barfull") 

if "Store" in str(row): 
range = str(row). 
count("read- 
barfull") 
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if "Value" in str(row): 
economy = 
str(row). 
count("read- 
barfull") 

if "Product" in str(row): 
product = 
str(row).count 
("smallratefull") 

author = bleach.clean(soup.findAll("spa 
n",{"itemprop":"author"})[0], tags=[], 
strip=True) 

date = Soup.findAll("meta",{"itemprop":"dat 

ePublished"})[0]["content"] 

args = [date, author,str(parking), 

str(ambience),str(range), str(econoffly), 

str(product), comment] 

super(RetailReview 
Collector,self).print_ 
args(args) 

######################## Main Function ######################## 

if _name_ == "_main_ 

if sys.argv[l] == 'airline': 

instance = AirLineReviewCollector(sys.argv[2]) 
instance.collectThoughtsO 

else: 

if sys.argv[l] == 'retail': 

instance = RetailReviewCollector(sys.argv[2]) 
instance.collectThoughtsO 
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else: 

print "Usage is" 

print sys.argv[o], '<airline/retail>', 
"<Config File Path>" 

The configuration for the previous code is shown here: 

base_url = http://www.airlinequality.coni/Foruni/ 

#base_url = http://www.niouthshut.coni/product-reviews/Mega-Mart- 
Bangalore-reviews -925103466 

#base_url = http://www.niouthshut.coni/product-reviews/Meganiart- 
Chennai-reviews -925104102 

air_lines = enirts,brit_awys,ual,binian,flydubai 
limits = 10 

Tll now discuss the previous code in hrief. It has a root class that is an 
ahstract class. It contains essential attrihutes such as a hase URL and a 
page limit; these are essential for all child classes. It also contains common 
logic in class method functions such as the download URL, print output, 
and read configuration. It also has an ahstract method collectThoughts, 
which must he implemented in child classes. This ahstract method is 
passing on a common hehavior to every child class that all of them must 
collect thoughts from the Weh. Implementations of this thought collection 
are child specific. 


Calling Other Languages in Python 

Now I will descrihe how to use other languages’ code in Python. There are 
two examples here; one is calling R code from Python. R code is required 
for some use cases. For example, if you want a ready-made function for the 
Holt-Winter method in a time series, it is difficult to do in Python. But it is 
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available in R. So, you can call R code from Python using the rpy2 module, 
as shown here: 

import rpy2.robjects as ro 

ro.r('data(input)') 

ro.r('x <-HoltWinters(input)') 

Sometimes you need to call Java code from Python. For example, 
say you are working on a name entity recognition prohlem in the field of 
natural language processing (NLP); some text is given as input, and you 
have to recognize the names in the text. Python's NLTK package does have 
a name entity recognition function, hut its accuracy is not good. Stanford 
NLP is a hetter choice here, which is written in Java. You can solve this 
problem in two ways. 

• You can call Java at the command line using 
Python code. 

import subprocess 

subprocess.call(['java','-cp','*','edu. 
stanford.nlp.sentiment.SentimentPipeline', 
'-file'/foo.txt']) 

• You can expose Stanford NLP as a web Service and call 
it as a Service. 

nlp = StanfordCoreNLP('http://127.0.0.1:9000') 
output = nlp.annotate(sentence, properties={ 
"annotators": "tokenize,ssplit,parse,sentiment", 
"outputFormat": "json", 

# Only split the sentence at End Of Line. 

We assume that this method only takes in one 
single sentence. 

"ssplit.eolonly": "true", 
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# Setting enforceRequirements to skip some 
annotators and make the process faster 
"enforceRequirements": "false" 

)) 

Exposing the Python ModeI 
as a Microservice 


You can expose the Python model as a microservice in the same way as 
your Python model can he used hy others to write their own code. The hest 
way to do this is to expose your model as a weh Service. As an example, the 
following code exposes a deep learning model using Flask: 

from flask import Flask, request, g 

from flask_cors import CORS 

import tensorflow as tf 

from sqlalchemy import * 

from sqlalchemy.orm import sessionmaker 

import pygeoip 

from pymongo import MongoClient 

import json 

import datetime as dt 

import ipaddress 

import math 

app = Flask(_name_) 

CORS(app) 

@app.before_request 
def before(): 

db = create_engine('sqlite:///score.db') 
metadata = MetaData(db) 
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g.scores = Table('scores', metadata, autoload=True) 
Session = sessionmal<er(bind=db) 
g.session = Session() 

Client = MongoClientO 
g.db = Client.frequency 

g.gi = pygeoip.GeoIP( 'GeoIP.dat' ) 

sess = tf.SessionO 

new_saver = tf.train.import_meta_graph('model.obj.meta') 
new_saver.restore(sess, tf.train.latest_checkpoint('./')) 
all_vars = tf.get_collection('vars') 

g.dropped_features = str(sess.run(all_vars[o])) 

g.b = sess.run(all_vars[l])[0] 

return 

def get_hour(timestanip): 

return dt.datetime.utcfronitimestanip(timestamp / le3).hour 

def get_value(session, scores, feature_name, feature_value): 
s = scores.select((scores.c.feature_name == feature_ 
name) & (scores.c.feature_value == feature_value)) 
rs = s.executeO 
row = rs.fetchoneO 
if row is not None: 

return float(row['score']) 

else: 

return 0.0 
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@app.route('/predict', methods=['POST']) 
def predict(): 

input_json = request.get_json(force=True) 

features = ['size'/ domain','client_time'/ device', 

'ad_position','client_size', 'ip' / root'] 
predicted = 0 
feature_value = '' 
for f in features: 

if f not in g.dropped_features: 
if f == 'ip': 

feature_value = str(ipaddress. 
IPv4Address(ipaddress.ip_address 
(unicode(request.refflote_addr)))) 

else: 

feature_value = input_json.get(f) 
if f == 'ip': 

if 'geo' not in g.dropped_features: 
geo = g.gi.country_nanie_by_ 
addr(feature_value) 
predicted = predicted + get_ 
value(g.session, g.scores, 
'geo', geo) 

if 'frequency' not in g.dropped_ 
features: 

res = g.db.frequency.find_ 
one({"ip" : feature_value}) 
freq = 1 

if res is not None: 

freq = res['frequency'] 
predicted = predicted + get_ 
value(g.session, g.scores, 
'frequency', str(freq)) 
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if f == 'client_time': 

feature_value = get_ 
hour(int(feature_value)) 
predicted = predicted + get_value(g. 
session, g.scores, f, feature_value) 
return str(math.exp(predicted + g.b)-l) 
app.run(debug = True, host ='0.0.0.0') 

This code exposes a deep learning model as a Flask web Service. 

A JavaScript Client will send the request with web user parameters such 
as the IP address, ad size, ad position, and so on, and it will return the 
price of the ad as a response. The features are categorical. You will learn 
how to convert them into numerical scores in Chapter 3. These scores 
are stored in an in-memory database. The Service fetches the score from 
the database, sums the resuit, and replies to the client. This score will be 
updated real time in each iteration of training of a deep learning model. It 
is using MongoDB to store the frequency of that IP address in that site. It is 
an important parameter because a user coming to a site for the first time 
is really searching for something, which is not true for a user where the 
frequency is greater than 5. The number of IP addresses is huge, so they 
are stored in a distributed MongoDB database. 


High-Performance API and Concurrent 
Programming 

Flask is a good choice when you are building a general solution that is 
also a graphical user interface (GUI). But if high performance is the most 
critical requirement ofyour application, then Falcon is the best choice. The 
following code is an example of the same model shown previously exposed 
by the Falcon framework. Another improvement I made in this code is that 
I implemented multithreading, so the code will be executed in parallel. 
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Except Falcon-specific changes, you should note the major changes in 
parallelizing the calling get_score function using a thread pool class. 

import falcon 

from falcon_cors import CORS 
import json 

from sqlalchemy import * 

from sqlalchemy.orm import sessionmaker 

import pygeoip 

from pymongo import MongoClient 

import json 

import datetime as dt 

import ipaddress 

import math 

from concurrent.futures import * 
from sqlalchemy.engine import Engine 
from sqlalchemy import event 
import sqliteS 

@event.listens_for(Engine, "connect") 

def set_sqlite_pragma(dbapi_connection, connection_record): 
cursor = dbapi_connection.cursor() 
cursor.execute("PRAGMA cache_size=100000") 
cursor.close() 

class Predictor(object): 

def _init_(self,domain): 

dbl = create_engine('sqlite:///score_' + domain + 
'Otest.db') 

db2 = create_engine('sqlite:///probability_' + 
domain +'Otest.db') 

db3 = create_engine('sqlite:///ctr_'+ domain + 
'test.db') 
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metadatal = MetaData(dbl) 
metadata2 = MetaData(db2) 
metadataS = MetaData(db3) 

self.scores = Table('scores', metadatal, autoload=True) 
self.probabilities = Table('probabilities', metadata2, 
autoload=True) 

self.ctr = Table('ctr', metadataS, autoload=True) 

Client = MongoClient(connect=False,maxPoolSize=l) 
self.db = Client.frequency 

self.gi = pygeoip.GeoIP( 'GeoIP.dat' ) 

self.high = 1.2 
self.low = .8 

def get_hour(self,timestamp): 

return dt.datetiffle.utcfromtimestamp(timestafflp / le3).hour 

def get_score(self, featurename, featurevalue): 
prob = 0 
pred = 0 

s = self.scores.select((self.scores.c.feature_name 
== featurename) & (self.scores.c.feature_value == 
featurevalue)) 
rs = s.executeO 
row = rs.fetchoneO 
if row is not None: 

pred = pred + float(row['score']) 
s = self.probabilities.select((self.probabilities.c.feature_ 
name == featurename) & (self.probabilities.c.feature_value == 
featurevalue)) 
rs = S.executeO 
row = rs.fetchoneO 
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if row is not None: 

prob = prob + float(row['Probability']) 
return pred, prob 

def get_value(self, f, value): 
if f == 'ip': 

ip = str(ipaddress.IPv4Address(ipaddress. 
ip_address(value))) 
geo = self.gi.country_name_by_addr(ip) 

predl, probi = self.get_score('geo', geo) 
res = self.db.frequency.find_one({"ip" : ip}) 
freq = 1 

if res is not None: 

freq = res['frequency'] 

pred2, prob2 = self.get_score('frequency', 
str(freq)) 

return (predl + pred2), (probi + prob2) 
if f == 'root': 

s = self.ctr.select(self.ctr.c.root == value) 
rs = s.executeO 
row = rs.fetchoneO 

if row is not None: 

ctr = row['ctr'] 
avv = row['avt'] 
avt = row['avv'] 

(predl,probi) = self.get_score 
('ctr', ctr) 

(pred2,prob2) = self.get_score 
('avt', avt) 

(pred3,prob3) = self.get_score 
('avv', avv) 
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(pred4,prob4) = self.get_score(f, 
value) 

return (predl + pred2 + pred3 + pred4), 
(probi + prob2 + prob3 + prob4) 
if f == 'client_time': 

value = str(self.get_hour(int(value))) 
if f == 'domain': 

conn = sqlite3.connect('multiplier.db') 
cursor = conn.execute("SELECT high,low from 
multiplier where domain='" + value + 
row = cursor.fetchoneO 
if row is not None: 

self.high = row[0] 
self.low = row[l] 
return self.get_score(f, value) 

def on_post(self, req, resp): 

input_json = json.loads(req.streani. 
read(),encoding='utf-8') 
input_json['ip'] = unicode(req.remote_addr) 
pred = 1 

prob = 1 

with ThreadPoolExecutor(max_workers=8) as pool: 
future_array = { pool.subniit(self.get_ 
value,f,input_json[f]) : f for f in 
input_json} 

for future in as_completed(future_array): 
predl, probi = future.result() 

pred = pred + predl 
prob = prob - probi 

resp.status = falcon.HTTP_200 
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res = math.exp(pred)-l 
if res < 0: 

res = 0 

prob = math.exp(prob) 
if(prob <= .1): 

prob = .1 
if(prob >= .9): 
prob = .9 

multiplier = self.low + (self.high -self.low)*prob 
pred = multiplier*pred 
resp.body = str(pred) 

cors = CORS(allow_all_origins=True,allow_all_ 
methods=True,allow_all_headers=True) 

wsgi_app = api = falcon.API(niiddleware=[cors.niiddleware]) 

f = open('publishersl.list') 
for domain in f: 

domain = domain.strip() 
p = Predictor(dofflain) 
uri = '/predict/' + domain 
api.add_route(url, p) 
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ETL with Python 
(Structured Data) 

Every data Science professional has to extract, transform, and load (ETL) 
data from different data sources. In this chapter, I will discuss how to do 
ETL with Python for a selection of popular datahases. Eor a relational 
datahase, TU cover MySQL. As an example of a document datahase, I will 
cover Elasticsearch. Eor a graph datahase, TU cover Neo4j, and for NoSQL, 
TU cover MongoDB. I will also discuss the Pandas framework, which was 
inspired hy R's data frame concept. 


MySQL 

MySQLdh is an API in Python developed at the top of the MySQL C interface. 

How to Install MySQLdb? 

Eirst you need to install the Python MySQLdh module on your machine. 
Then run the following script: 

#! /usr/bin/python 
import MySQLdb 


© Sayan Mukhopadhyay 2018 

S. Mukhopadhyay, Advanced Data Analytics Using Python, 
https://dol.org/10.1007/978-l-4842-3450-l_2 
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Ifyou get an import error exception, that means the module is not 
installed properly. 

The following is the instruction to install the MySQL Python module: 

$ gunzip MySQL-python-1.2.2.tar.gz 
$ tar -xvf MySQL-python-1.2.2.tar 
$ cd MySQL-python-1.2.2 
$ python Setup.py build 
$ python Setup.py install 

Database Connection 

Before connecting to a MySQL datahase, make sure you have the following: 

• You need a datahase called TE ST. 

• In TEST you need a tahle STUDENT. 

. STUDENT needs three fields: NAME, SUR_NAME, and R0LL_N0. 

• There needs to he a user in TE ST that has complete 
access to the datahase. 

INSERT Operation 

The following code carries out the SQL INSERT statement for the purpose 
of creating a record in the STUDENT tahle: 

#! /usr/bin/python 

import MySOLdb 

# Open database connection 

db = MySOLdb.connect("localhost","user","passwd","TEST" ) 

# prepare a cursor object using cursor() method 
cursor = db.cursor() 
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# Prepare SQL query to INSERT a record into the database. 
sql = """INSERT INTO STUDENT(NAME, 

SUR_NAME, ROLL_NO) 

VALUES ('Sayan', 'Mukhopadhyay', i)""" 

try: 

# Execute the SQL command 
cursor.execute(sql) 

# Cofflfflit your changes in the database 
db.comniit() 

except: 

# Rollback in case there is any error 
db.rollbackO 

# disconnect from server 
db.closeO 

READ Operation 

The following code fetches data from the STUDENT tahle and prints it: 
#! /usr/bin/python 
import MySQLdb 

# Open database connection 

db = MySQLdb.connect("localhost","user","passwd","TEST" ) 

# prepare a cursor object using cursor() method 
cursor = db.cursor() 

# Prepare SQL query to INSERT a record into the database. 
sql = "SELECT * FROM STUDENT " 

try: 


25 


CHAPTER 2 ETL WITH PYTHON (STRUCTURED DATA) 

# Execute the SQL command 
cursor.execute(sql) 

# Fetch all the rows in a list of lists. 
results = cursor.fetchallO 

for row in results: 
fname = row[0] 

Iname = row[l] 
id = row[2] 

# Now print fetched resuit 
print "name=%s,surname=%s,id=%d" % \ 

(fname, Iname, id ) 

except: 

print "Error: unable to fecth data" 

# disconnect from server 
db.closeO 

DELETE Operation 

The following code deletes a row from TEST with id=l: 

#! /usr/bin/python 
import MySQLdb 

# Open database connection 

db = MySQLdb.connect("localhost","test","passwd","TEST" ) 

# prepare a cursor object using cursor() method 
cursor = db.cursor() 

# Prepare SQL query to DELETE required records 
sql = "DELETE FROM STUDENT WHERE R0LL_N0 =1" 
try: 
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# Execute the SQL command 
cursor.execute(sql) 

# Commit your changes in the database 
db.comniitO 

except: 

# Rollback in case there is any error 
db.rollbackO 

# disconnect from server 
db.closeO 

UPDATE Operation 

The following code changes the lastname variahle to Mukherjee, from 
Mukhopadhyay: 

#! /usr/bin/python 

import MySQLdb 

# Open database connection 

db = MySQLdb.connect("localhost","user","passwd","TEST" ) 

# prepare a cursor object using cursor() method 
cursor = db.cursor() 

# Prepare SQL query to UPDATE required records 
sql = "UPDATE STUDENT SET SUR_NAME="Mukherjee" 

WHERE SUR_NAME="Mukhopadhyay" 

try: 

# Execute the SQL command 
cursor.execute(sql) 

# Commit your changes in the database 
db.commit0 
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except: 

# Rollback in case there is any error 
db.rollbackO 

# disconnect from server 
db.closeO 

COMMIT Operation 

The commit operation provides its assent to the database to finalize the 
modifications, and after this operation, there is no way that this can be 
reverted. 

ROLL-BACK Operation 

If you are not completely convinced about any of the modifications and you 
wantto reverse them, thenyou need to apply the roll-back() method. 

The following is a complete example of accessing MySQL data through 
Python. It will give the complete description of data stored in a CSV file or 
MySQL database. 

import MySOLdb 
import sys 

out = open( 'Configl.txt' /w') 
print "Enter the Data Source Type:" 
print "1. MySql" 
print "2. Text" 
print "3. Exit" 

while(l): 

datal = sys.stdin.readlineO .stripO 
if(int(datal) == l): 
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out.write("source begin"+"\n"+"type=fflysql\n") 

print "Enter the ip:" 

ip = sys.stdin.readlineO .stripO 

out.write("host=" + ip + "\n") 

print "Enter the database name:" 

db = sys.stdin.readlineO .stripO 

out.write("database=" + db + "\n") 

print "Enter the user name:" 

usr = sys.stdin.readlineO .stripO 

out.write("user=" + usr + "\n") 

print "Enter the password:" 

passwd = sys.stdin.readlineO .stripO 

out.write("password=" + passwd + "\n") 

connection = MySQLdb.connect(ip, usr, passwd, db) 

cursor = connection.cursorO 

query = "show tables" 

cursor.execute(query) 

data = cursor.fetchallO 

tables = [] 

for row in data: 

for field in row: 

tables. append (field. stripO) 
for i in range(len(tables)): 
print i, tables[i] 

tb = tables[int(sys.stdin.readlineO.stripO)] 

out.write("table=" + tb + "\n") 

query = "describe " + tb 

cursor.execute(query) 

data = cursor.fetchallO 

columns = [] 

for row in data: 

columns.append(row[0] .stripO) 
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for i in range(len(columns)): 
print columns[i] 

print "Not index choose the exact column names 
seperated by coma" 
cois = sys.stdin.readlineO .stripO 
out.write("columns=" + cois + "\n") 

cursor.close() 
connection.close() 
out.write("source end"+"\n") 

print "Enter the Data Source Type:" 
print "1. MySql" 
print "2. Text" 
print "3. Exit" 

if(int(datal) == 2): 

print "path of text file:" 
path = sys.stdin.readlineO .stripO 
file = open(path) 
count = 0 
for line in file: 
print line 
count = count + 1 
if count > 3: 
break 

file.closeO 

out.write("source begin"+"\n"+"type=text\n") 

out.write("path=" + path + "\n") 

print "enter delimeter:" 

dlm = sys.stdin.readlineO .stripO 

out.write("dlni=" + dlm + "\n") 

print "enter column indexes seperated by comma: 
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cois = sys.stdin.readlineO .stripO 
out.write("columns=" + cois + "\n") 
out.write("source end"+"\n") 

print "Enter the Data Source Type:" 
print "1. MySql" 
print "2. Text" 
print "3. Exit" 

if(int(datal) == 3): 
out.closeO 
sys.exitO 


Elasticsearch 


The Elasticsearch (ES) low-level Client gives a direct mapping from 
Python to ES REST endpoints. One of the hig advantages of Elasticsearch 
is that it provides a full stack solution for data analysis in one place. 
Elasticsearch is the datahase. It has a configurahle front end called 
Kihana, a data collection tool called Logstash, and an enterprise security 
feature called Shield. 

This example has the features called cat, cluster, indices, ingest, 
nodes, snapshot, and tasks that translate to instances of CatClient, 
ClusterClient, IndicesClient, CatClient, ClusterClient, 
IndicesClient, IngestClient, NodesClient, SnapshotClient, 
NodesClient, SnapshotClient, and TasksClient, respectively. These 
instances are the only supported way to get access to these classes and 
their methods. 

You can specify your own connection class, which can he used hy 
providing the connection_class parameter. 


# create connection to local host using the ThriftConnection 
Esl=Elasticsearch(connection_class=ThriftConnection) 
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If you want to turn on sniffing, then you have several options 
(described later in the chapter). 

# create connection that will automatically inspect the cluster to get 

# the list of active nodes. Start with nodes running on 'esnodel' and 

# 'esnode2' 

Esl=Elasticsearch( 

['esnodel', 'esnode2'], 

# sniff before doing anything 
sniff_on_start=True, 

# refresh nodes after a node fails to respond 
sniff_on_connection_fail=True, 

# and also every 30 seconds 
sniffer_timeout=30 

) 

Different hosts can have different parameters; you can use one 
dictionary per node to specify them. 

# connect to localhost directly and 
another node using SSL on port 443 

# and an url_prefix. Note that "port" needs to be an int. 
Esl=Elasticsearch([ 

{'host':'localhost'}, 

{'host':'othernode','port':443,'url_prefix':'es','use_ssl':True}, 

]) 

SSL Client authentication is also supported (see Llrllib3HttpConnection 
for a detailed description of the options). 

Esl=Elasticsearch( 

['localhost:443','other_host:443'], 

# turn on SSL 
use_ssl=True, 
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# make sure we verify SSL certificates (off by default) 
verify_certs=True, 

# provide a path to CA certs on disk 
ca_certs='path to CA_certs', 

# PEM formatted SSL client certificate 
client_cert='path to clientcert.pem', 

# PEM formatted SSL client key 
client_key='path to clientkey.pem' 

) 

Connection Layer API 

Many classes are responsible for dealing with the Elasticsearch cluster. 
Here, the default subclasses being utilized can be disregarded by handing 
over parameters to the Elasticsearch class. Every argument belonging to 
the Client will be added onto Transport, ConnectionPool, and Connection. 

As an example, if you want to use your own personal utilization of the 
ConnectionSelector class, you just need to pass in the selector_class 
parameter. 

The entire API wraps the raw REST API with a high level of accuracy, 
which includes the differentiation between the required and optional 
arguments to the calls. This implies that the code makes a differentiation 
between positional and keyword arguments; I advise you to use keyword 
arguments for all calls to be consistent and safe. An API call becomes 
successful (and will return a response) if Elasticsearch returns a 2XX 
response. Otherwise, an instance of Transport Error (or a more specific 
subclass) will be raised. You can see other exceptions and error States in 
exceptions. Ifyou do notwant an exception to be raised, you can always 
pass in an ignore parameter with either a single status code that should be 
ignored or a list of them. 


33 


CHAPTER 2 ETL WITH PYTHON (STRUCTURED DATA) 

from elasticsearch import Elasticsearch 
es=Elasticsearch() 

# ignore 400 cause by IndexAlreadyExistsException when creating 
an index 

es.indices.create(index='test-index',ignore=400) 

# ignore 404 and 400 

es.indices.delete(index='test-index',ignore=[400,404]) 

Neo4j Python Driver 

The Neo4j Python driver is supported hy Neo4j and connects with the 
datahase through the hinary protocol. It tries to remain minimalistic hut at 
the same time he idiomatic to Python. 

pip install neo4j-driver 

from neo4j.vl import GraphDatabase, basic_auth 

driverll = GraphDatabase.driver("bolt://localhost", auth=basic_ 
auth("neo4j", "neo4j")) 
sessionll = driverll.session() 

sessionll.run("CREATE (a:Person {name:'Sayan', 
title:'Mukhopadhyay'})") 

resuit 11= sessionll.run("MATCH (a:Person) WHERE a.name = 
'Sayan' RETURN a.name AS name, a.title AS title") 
for recordi n resulllt: 

print("%s %s"% (record["title"], record["name"])) 
session.closeO 
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neo4j-rest-client 

The main objective of neo4j-rest-client is to make sure that the Python 
programmers already using Neo4j locally through python-embedded are 
also able to access the Neo4j REST server. So, the structure of the neo4j- 
rest-client API is completely in sync with python-embedded. But, a new 
structure is brought in so as to arrive at a more Pythonic style and to 
augment the API with the new features being introduced by the Neo4j team. 

In-Memory Database 

Another important class of database is an in-memory database. It Stores 
and processes the data in RAM. So, operation on the database is very 
fast, and the data is volatile. SQLite is a popular example of in-memory 
database. In Python you need to use the sqlalchemy library to operate on 
SQLite. In Chapter I’s Flask and Falcon example, I showed you how to 
select data from SQLite. Here I will show how to store a Pandas data frame 
in SQLite: 

from sqlalchemy import create_engine 
import sqliteS 

conn = sqlite3.connect('multiplier.db') 
conn.execute('''CREATE TABLE if not exists multiplier 
(domain CHAR(50), 

low REAL, 

high REAL);'") 

conn.closeO 

db_name = "sqlite:///" -i- prop -i- -i- domain -i- str(i) -i- ".db" 

disk_engine = create_engine(db_name) 

df.to_sql('scores', disk_engine, if_exists='replace') 
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MongoDB (Python Edition) 

MongoDB is an open source document database designed for superior 
performance, easy availability, and automatic scaling. MongoDB makes 
sure that object-relational mapping (ORM) is not required to facilitate 
development. A document that contains a data structure made up of 
field and value pairs is referred to as a record in MongoDB. These records 
are akin to JSON objects. The values of fields may be comprised of other 
documents, arrays, and arrays of documents. 

{ 

"_id":0bjectld("0l"), 

"address": { 

"Street" :"Siraj Mondal Lane", 

"pincode":"743145", 

"building":"l29", 

"coord": [ -24.97, 48.68 ] 

"borough":"Manhattan", 

Import Data into the Collection 

mongoimport can be used to place the documents into a collection in a 
database, within the system shell or a command prompt. If the collection 
preexists in the database, the operation will discard the original 
collection first. 

mongoimport --DB test --collection restaurants --drop --file ~/ 
downloads/primer-dataset.json 

The mongoimport command is joined to a MongoDB instance running 
on localhost on port number 27017. The - -file option provides a way to 
import the data; here it’s ~/downloads/primer-dataset. json. 
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To import data into a MongoDB instance running on a different host 
or port, the hostname or port needs to be mentioned specifically in the 
mongoimport command by including the - -host or - -port option. 

There is a similar load command in MySQL. 

Create a Connection Using pymongo 

To create a connection, do the following: 

import MongoClient from pymongo. 

Clientll = MongoClient0 

If no argument is mentioned to MongoClient, then it will default to the 
MongoDB instance running on the localhost interface on port 27017. 

A complete MongoDB URL may be designated to define the 
connection, which includes the host and port number. For instance, the 
following code makes a connection to a MongoDB instance that runs on 
mongodbO. example. net and the port of 27017: 

Clientll = MongoClient("mongodb://myhostname:27017") 

Access Database Objects 

To assign the database named primer to the local variable DB, you can use 
either of the following lines: 

Dbll = clientll.primer 
dbll = clientll['primer'] 

Collection objects can be accessed directly by using the dictionary 
style or the attribute access from a database object, as shown in the 
following two examples: 

Coilli = dbll.dataset 
coli = dbll['dataset'] 
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Insert Data 

You can place a document into a collectiori that doesn’t exist, and the 
following operation will create the collection: 

result=db.addrss.insert_one({<<your json >>) 

Update Data 

Here is how to update data: 

result=db.address.update_one( 

{"building": "129", 

{"$set": {"address.Street": "MG Road"}} 

) 

Remove Data 

To expunge all documents from a collection, use this: 
result=db.restaurants.delete_many({}) 


Pandas 


The goal of this section is to show some examples to enahle you to hegin 
using Pandas. These illustrations have heen taken from real-world 
data, along with any hugs and weirdness that are inherent. Pandas is a 
framework inspired hy the R data frame concept. 

To read data from a CSV file, use this: 

import pandas as pd 
broken_df=pd.read_csv('data.csv') 
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To look at the first three rows, use this: 

broken_df[:3] 

To select a column, use this: 

fixed_df['Column Header'] 

To plot a column, use this: 

fixed_df['Column Header'].plot() 

To get a maximum value in the data set, use this: 

Max\/alue=df['Births'].fflax() where Births is the column header 

Let’s assume there is another column in a data set named Name. The 
command for Name is associated with the maximum value. 

MaxName=df['Names'][df['Births']==df['Births'].max()].values 

There are many other methods such as sort, groupby, and orderby 
in Pandas that are useful to play with structured data. Also, Pandas has a 
ready-made adapter for popular datahases such as MongoDB, Google Big 
Query, and so on. 

One complex example with Pandas is shown next. In X data frame for 
each distinet column value, find the average value of floor grouping hy the 
root column. 

for coi in X.columns: 

if coi != 'root': 

avgs = df.groupby([coi,'root'], 
as_index=False)['floor']. 
aggregate(np.mean) 
for i,row in avgs.iterrows(): 
k = row[col] 
v = row['floor'] 
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r = row['root'] 
X.loc[(X[col] == k) & 
(X['root'] == r), coi] 

= v2. 

ETL with Python (Unstructured Data) 

Dealing with unstructured data is an important task in modern data 
analysis. In this section, I will cover how to parse e-mails, and ITI introduce 
an advanced research topic called topical crawling. 

E-maiI Parsing 

See Chapter 1 for a complete example of web crawling using Python. 

Like BeautifulSoup, Python has a lihrary for e-mail parsing. The 
following is the example code to parse e-mail data stored on a mail server. 
The inputs in the configuration are the username and numher of mails to 
parse for the user. 

from email.parser import Parser 
import os 
import sys 

conf = open(sys.argv[l]) 

config={} 

users={} 

for line in conf: 

if in line): 

fields = line.split(",") 
key = fields[o] .stripO .split(" = ") [1] .stripO 
val = fields[l].StripO.split("=")[l].StripO 
users[key] = val 
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else: 

if ("=" in line): 

words = line.stripO .split(' = ') 
config[words[0] .stripO] = words[l] .strip() 

conf.close() 

for usr in users.keys(): 

path = config["path"]+"/"+usr+"/"+config["folder"] 
files = os.listdir(path) 
for f in sorted(files): 

if(int(f) > int(users[usr])): 
users[usr] = f 
pathl = path + "/" + f 
data = "" 

with open (pathl) as myfile: 

data=myfile.read() 
if data != "" : 

parser = Parser() 
email = parser.parsestr(data) 
out = "" 

out = out + str(eniail.get(' From')) + 

+ str(email.get('To')) + + str(email. 

get('Subject')) + + str(efflail. 

getCDate')).replace("/'/' ") 

if email.is_multipart(): 

for part in email.get_payload(): 

out = out + + str(part. 

get_payload()).replace("\n"," 
").replace("/'/' ") 

else: 
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out = out + + str(email.getj)ayload 

()).replace("\n"/' ").replace("/'/' ") 

print out/'\n" 

conf = open(sys.argv[l],'w') 

conf.write("path=" + config["path"] + "\n") 

conf.write("folder=" + config["folder"] + "\n") 

for usr in users.keys(): 

conf.write("name="+ usr +",value=" + users[usr] + "\n") 
conf.close() 

Sample config file for above code. 

path=/cygdrive/c/share/enron_mail_20110402/enron_mail_20110402/ 

maildir 

folder=Inbox 

name=storey-g,value=142 

name=ybarbo-p,value=775 

nanie=tycholiz-b, value=602 

Topical Crawling 

Topical crawlers are intelligent crawlers that retrieve information from 
anywhere on the Web. They start with a URL and then find links present in 
the pages under it; then they look at new URLs, bypassing the scalability 
limitations of universal search engines. This is done by distributing 
the crawling process across users, queries, and even client computers. 
Crawlers can use the context available to infinitely loop through the links 
with a goal of systematically locating a highly relevant, focused page. 

Web searching is a complicated task. A large chunk of machine 
learning work is being applied to find the similarity between pages, such as 
the maximum number of URLs fetched or visited. 
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Crawling Algorithms 

The following diagram describes how the topical crawling algorithm works 


Breadth-First Cstartlng.urls) •{ 
foreach llnk (startlng.urls) -{ 
enqueueCfrontler, llnk); 

} 

while (vislted < MAX.PAGES) { 

llnk dequeue.llnk(lrontler); 
doc :■ ^etchCllnk); 

enqueueCfrontler, extract.linksCdoc)); 
If C#frontler > KAX.BUFFER) { 
dequeue.last.linkoCfrontler); 

> 

> 

> 

BFS (toplc, startlng.urls) { 

foreach llnk (startlng.urls) { 
enqueueCfrontler, llnk, 1); 

} 

uhile (visited < MAX.PACES) < 

llnk :■ dequeue.top.llnkClrontler); 
doc :* fetchClink); 
score sinCtoplc, doc); 

enqueueCfrontler, extract.linksCdoc), score); 

If («Irontler > MAX.BUFFER) { 

dequeue.botton.linksCfrontler); 

} 

} 

} 

The starting URL of a topical crawler is known as the seed URL. There 
is another set of URLs known as the target URLs, which are examples of 
desired output. 

An interesting application of topical crawling is where an HR organization 
is searching for a candidate from anywhere on the Web possessing a 
particular skill set. One easy alternative solution is to use a search engine 
API. The following code is an example of using the Google Search API, 
BeautifulSoup, and regular expressions that search the e-mail ID and phone 
number of potential candidates with a particular skill set from the Web. 


with its major components. 



Web 

page 


Links 







FIFO queue 

< 

1- 



#!/usr/bin/env python 
# coding: utf-8 
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import pprint, json, urllib2 
import nltk, sys, urllib 
from bs4 import BeautifulSoup 
import csv 

from googleapiclient.discovery import build 
def link_score(link): 

if ('cv' in link or 'resume' in link) and 'job' not in link 
return True 

def process_file(): 
try: 

with open('datal.json','r') as fl: 
data = json.load(fl) 
all_links = [] 

# pprint.pprint(len(data['items'])) 
for item in data['items']: 

# print item['formattedUrl'] 
all_links.append(item['formattedUrl']) 
return all_links 

except: 
return [] 

def main(istart, search_query): 

Service = build("customsearch", "vl", 
developerKey="abcd") 

res = Service.cse().list( 
q= search_query, 
cx='1234', 
num=10, 

gl='in', #in for india comment this for whole web 
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start = istart, 

) .executeO 
import json 

with open('datal.json', 'w') as fp: 
json.dunip(res, fp) 

# pprint.pprint(type(res)) 

# pprint.pprint(res) 

def get_efflail_ph(link_text, pdf=None): 
if pdf==True: 

from textract import process 
text = process(link_text) 
else: 

text = link_text 

# print text 
import re 
email = [] 
ph = [] 

valid_ph = re.compile("[789][0-9]{9}$") 

valid = re.compile("[A-Za-z]+[@]{l}[A-Za-z]+\.[a-z]+") 

for token in re.split(r'[,\s]',text): 

# for token in nltk.tokenize(text): 

# print token 

a = valid.match(token) 
b = valid_ph.match(token) 
if a != None: 
print a.groupO 
email.append(a.group()) 
if b != None: 
print b.groupO 
ph.append(b.group()) 
return email, ph 
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def process_pdf_link(link): 
html = urllib2.urlopen(link) 
file = open("document.pdf", 'w') 
file.write(html.read()) 
file.closeO 

return get_efflail_ph( "document.pdf", pdf=True) 

def process_doc_link(link): 
testfile = urllib.URLopenerO 
testfile.retrieve(link, "document.doc") 
return get_efflail_ph("document.doc", pdf=False) 

def process_docx_link(link): 
testfile = urllib.URLopenerO 
testfile.retrieve(link, "document.docx") 
return get_email_ph("document.docx", pdf=False) 

def process_links(all_links): 
with open('email_ph.csv', 'wb') as csvfile: 
spamwriter = csv.writer(csvfile, delimiter=',') 

for link in all_links: 
if link[:4] !='http': 
link = "http://"+link 
print link 
try: 

if link[-3:] == 'pdf': 
try: 

email, ph = process_pdf_link(link) 

spamwriter.writerow([link, ' '.join(email), ' '.join(ph)]) 
except: 

print "error",link 
print sys.exc_info() 
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elif link[-4:] == 'docx': 
try: 

email, ph = process_docx_link(link) 

spamwriter.writerow([link, ' '.join(efflail), ' '.join(ph)]) 
except: 

print "error",link 
print sys.exc_info() 

spamwriter.writerow([link, ' '.join(efflail), ' '.join(ph)]) 

elif link[-3:] == 'doc': 

try: 

email, ph = process_doc_link(link) 

spamwriter.writerow([link, ' '.join(efflail), ' '.join(ph)]) 
except: 

print "error",link 
print sys.exc_info() 

spamwriter.writerow([link, ' '.join(efflail), ' '.join(ph)]) 

else: 

try: 

html = urllib2.urlopen(link) 

email, ph = get_email_ph(BeautifulSoup(htnil.read()).get_ 
text(), pdf=False) 

spamwriter.writerow([link, ' '.join(efflail), ' '.join(ph)]) 
except: 

print "error",link 
print sys.exc_info() 

spamwriter.writerow([link, ' '.join(email), ' '.join(ph)]) 

except: 

pass 

print "error",link 
print sys.exc_info() 

if name == ' main ': 
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# 

search_query = ' ASP .NET, C#, WebServices, HTML Chicago USA 
biodata cv' 

# 

# 

all_links = [] 

# all_links.extend(links) 
for i in range(l,90,10): 
main(i, search_query) 
all_links.extend(process_file()) 

process_links(all_links) 

# 


This code is used to find relevant contacts from the Web for a set of 
given job-searching keywords. It uses the Google Search API to fetch 
the links, filters them according to the presence of certain keywords in a 
URL, and then parses the link content and finds the e-mail ID and phone 
number. The content may be PDF or Word or HTML documents. 
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Supervised Learning 
Using Python 

In this chapter, I will introduce the three most essential components of 
machine learning. 

• Dimensionality reductiori telis how to choose the most 
important features from a set of features. 

• Classification telis how to categorize data to a set of 
target categories with the help of a given training/ 
example data set. 

• Regression telis how to realize a variahle as a linear or 
nonlinear polynomial of a set of independent variahles. 

Dimensionality Reduction with Python 

Dimensionality reduction is an important aspect of data analysis. It is 
required for hoth numerical and categorical data. Survey or factor analysis 
is one of the most popular applications of dimensionality reduction. As an 
example, suppose that an organization wants to find out which factors are 
most important in influencing or hringing ahout changes in its operatioris. 
It takes opinions from different employees in the organization and, hased 
on this survey data, does a factor analysis to derive a smaller set of factors 
in conclusion. 


© Sayan Mukhopadhyay 2018 

S. Mukhopadhyay, Advanced Data Analytics Using Python, 
https://dol.org/10.1007/978-l-4842-3450-l_3 
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In investment banldng, different indices are calculated as a weighted 
average of instruments. Thus, when an index goes high, it is expected 
that instruments in the index with a positive weight will also go high and 
those with a negative weight will go low. The trader trades accordingly. 
Generally, indices consist of a large numher of instruments (more than 
ten). In high-frequency algorithmic trading, it is tough to send so many 
orders in a fraction of a second. Using principal component analysis, 
traders realize the index as a smaller set of instruments to commence with 
the trading. Singular value decomposition is a popular algorithm that 
is used both in principal component analysis and in factor analysis. In 
this chapter, I will discuss it in detail. Before that, I will cover the Pearson 
correlation, which is simple to use. That’s why it is a popular method of 
dimensionality reduction. Dimensionality reduction is also required for 
categorical data. Suppose a retailer wants to know whether a city is an 
important contributor to sales volume; this can be measured by using 
mutual information, which will also be covered in this chapter. 


Correlation Analysis 


There are different measures of correlation. I will limit this discussion 
to the Pearson correlation only. For two variables, x andy, the Pearson 
correlation is as follows: 




r = 



The value of r will vary from -I to +1. The formula clearly shows that 
when X is greater than its average, then y is also greater, and therefore the r 
value is bigger. In other words, if x increases, then y increases, and then r is 
greater. So, if r is nearer to I, it means that x and y are positively correlated. 
Similarly, if r is nearer to -1, it means that x and y are negatively correlated. 
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Likewise, if r is nearer to 0, it means that x and y are not correlated. A 
simplified formula to calculate r is shown here: 


r= 




nXx=-(ixY nZf-(Iyf 


You can easily use correlation for dimensionality reduction. Let’s say 
Fis a variable that is a weighted sum of n variahles: XI, X2, ... Xn. You 
want to reduce this set ofXto a smaller set. To do so, you need to calculate 
the correlation coefficient for eachXpair. Now, if Xi and Xj are highly 
correlated, then you will investigate the correlation of Fwith Xi and Xj. If 
the correlation of Xi is greater than Xj, then you remove Xj from the set, and 
vice versa. The following function is an example of the dropping feature 
using correlation: 

from scipy.stats.stats import pearsonr 

def drop_features(y_train,X_train,X,index): 
il = 0 

processed = 0 
while(l): 
flag = True 

for i in range(X_train.shape[l]): 
if i > processed : 
il = il + 1 

corr = pearsonr(X_train[:,i], y_train) 
PEr= .674 * (l- corr[0]*corr[0])/ (len(X_ 
train[i])**(1/2.0)) 
if corr[0] < PEr: 

X_train = np.delete(X_train,i,l) 
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index.append(X. 
columns[il-l]) 
processed = i - 1 
flag = False 

break 

if flag: 
break 

return X_train, index 

The actual use case of this code is shown at the end of the chapter. 

Now, the question is, what should he the threshold value of the 
previous correlation that, say, X and Y are correlated. A common practice 
is to assume that if r > 0.5, it means the variahles are correlated, and if 
r < 0.5, then it means no correlation. One hig limitation of this approach 
is that it does not consider the length of the data. For example, a 0.5 
correlation in a set of 20 data points should not have the same weight as a 
0.5 correlation in a set of 10,000 data points. To overcome this prohlem, a 
prohahle error concept has heen introduced, as shown here: 

PEr-.674x ^ ^ 
yjn 

r is the correlation coefficient, and n is the sample size. 

Here, r > 6PEr means that X and Y are highly correlated, and if r < Per, 
this means that X and Y are independent. Using this approach, you can see 
that even r - 0.1 means a high correlation when the data size is huge. 

One interesting application of correlation is in product 
recommendations on an e-commerce site. Recommendations can identify 
similar users if you calculate the correlation of their common ratings for 
the same products. Similarly, you can find similar products hy calculating 
the correlation of their common ratings from the same user. This approach 
is known as collaborative filtering. 
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Principal Component Analysis 

Theoretically correlation works well for variables with Gaussian 
distributiori, in otherwords, independent variables. For other scenarios, 
you have to use principal component analysis. Suppose you want to 
reduce the dimension ofAA variables: XI, X2, ... Xn. Let’s form a matrix 
of NxN dimension where the i-th column represents the observation Xi, 
assuming all variables have N number of observations. Now if k variables 
are redundant, for simplicity you assume k columns are the same or 
linear combination of each other. Then the rank of the matrix will be N-k. 
So, the rank of this matrix is a measure of the number of independent 
variables, and the eigenvalue indicates the strength of that variable. This 
concept is used in principal component analysis and factor analysis. To 
make the matrix square, a covariance matrix is chosen. Singular value 
decomposition is used to find the eigenvalue. 

Let Fbe the input matrix of size pxq, where p is the number of data 
rows and q is the number of parameters. 

Then the qxq covariance matrix Co is given by the following: 

Co=YTY/(q-l) 

It is a symmetric matrix, so it can be diagonalized as follows: 

Co=UDUT 

Each column of U is an eigenvector, and D is a diagonal matrix with 
eigenvalues Ai in the decreasing order on the diagonal. The eigenvectors 
are referred to as principal axes or principal directions of the data. 
Projections of the data on the principal axes called principal components 
are also known as PC scores; these can be seen as new, transformed 
variables. The j-th principal component is given by j-th column of YU. The 
coordinates of the i-th data point in the new PC space are given by the i-th 
rowofYU. 
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The singular value decomposition algorithm is used to find D and U. 


The following code is an example of factor analysis in Python. 
Here is the input data: 


Government 

Competitors’ 

Competition 

Suppiier 

Customer 

Technoiogy 

Policy Changes 

Strategic 

Decisions 


Reiation 

Feedback 

innovatione 

Strongly Agree 

Agree 

Agree 

Agree 

Somewhat 

Agree 

Somewhat 

Disagree 

Somewhat 

Disagree 

Somewhat 

Disagree 

Somewhat 

Agree 

Disagree 

Disagree 

Agree 

Somewhat 

Somewhat 

Strongiy 

Agree 

Somewhat 

Strongiy 

Agree 

Agree 

Agree 


Agree 

Agree 

Somewhat 

Somewhat 

Agree 

Somewhat 

Somewhat 

Somewhat 

Disagree 

Agree 


Disagree 

Disagree 

Agree 

Somewhat 

Disagree 

Agree 

Agree 

Somewhat 

Agree 

Somewhat 

Agree 

Agree 

Agree 

Somewhat 

Somewhat 

Strongiy 

Somewhat 

Somewhat 


Disagree 

Agree 

Agree 

Agree 

Agree 

Agree 

Agree 

Strongiy 

Agree 

Somewhat 

Agree 

Agree 

Somewhat 

Agree 

Somewhat 

Disagree 

Agree 

Somewhat 

Agree 

Agree 

Agree 

Somewhat 

Agree 

Somewhat 

Agree 

Somewhat 

Agree 

Agree 

Agree 

Agree 

Somewhat 

Agree 

Somewhat 

Disagree 

Agree 

Strongiy 

Agree 

Somewhat 

Disagree 

Agree 

Somewhat 

Disagree 

Somewhat 

Agree 

Agree 

Somewhat 

Disagree 

Strongiy 

Agree 

Somewhat 

Agree 

Disagree 
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Somewhat 

Somewhat 

Somewhat 

Somewhat 

Somewhat 

Somewhat 

Disagree 

Disagree 

Agree 

Disagree 

Disagree 

Agree 

Somewhat 

Agree 

Somewhat 

Agree 

Somewhat 

Somewhat 

Agree 


Agree 


Agree 

Agree 

Somewhat 

Agree 

Strongiy 

Somewhat 

Somewhat 

Disagree 

Disagree 


Agree 

Disagree 

Agree 


Somewhat 

Somewhat 

Strongiy 

Strongiy 

Strongiy 

Agree 

Agree 

Disagree 

Agree 

Agree 

Agree 


Somewhat 

Somewhat 

Agree 

Somewhat 

Strongiy 

Disagree 

Agree 

Agree 


Disagree 

Agree 


Somewhat 

Agree 

Agree 

Somewhat 

Agree 

Somewhat 

Disagree 



Disagree 


Agree 

Somewhat 

Strongiy 

Somewhat 

Somewhat 

Agree 

Somewhat 

Agree 

Agree 

Agree 

Agree 


Agree 

Strongiy Agree 

Somewhat 

Somewhat 

Agree 

Somewhat 

Somewhat 


Disagree 

Disagree 


Agree 

Disagree 

Somewhat 

Somewhat 

Agree 

Somewhat 

Strongiy 

Somewhat 

Agree 

Disagree 


Agree 

Agree 

Disagree 

Agree 

Somewhat 

Strongiy 

Somewhat 

Agree 

Agree 


Agree 

Agree 

Disagree 



Somewhat 

Strongiy 

Somewhat 

Somewhat 

Somewhat 

Disagree 

Agree 

Agree 

Agree 

Disagree 

Disagree 



Before running the code, you have to enter a numeric value for categorical data, for 
example: Strongly Agree = 5, Agree = 4, SomewhatAgree = 3. 
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import pandas as pd 
data = pd.read_csv('<input csvfile>) 

from sklearn.decomposition import FactorAnalysis 
factors = FactorAnalysis(n_components=6).fit(data) 
print (factors.components) 

from sklearn.decomposition import PCA 
pcomponents = PCA(n_components=6).fit(data) 
print(pcomponents.components) 

Mutual Information 

Mutual Information (MI) of two random variables is a measure of the 
mutual dependence between the two variables. It is also used as a 
similarity measure of the distribution of two variables. A higher value of 
mutual Information indicates the distribution of two variables is similar. 

;F) = ^(x,y)log-^ 

Ty P[^ 

Suppose a retailer wants to investigate whether a particular city 
is a deciding factor for its sales volume. Then the retailer can see the 
distribution of sales volume across the different cities. If the distribution is 
the same for all cities, then a particular city is not an important factor as far 
as sales volume is concerned. To calculate the difference between the two 
probability distributions, mutual Information is applied here. 

Here is the sample Python code to calculate mutual Information: 

fromscipy.stats import chi2_contingency 

defcalc_MI(x, y, bins): 

c_xy = np.histogram2d(x, y, bins)[o] 
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g, p, dof, expected = chi2_contingency(c_xy, lambda_="log- 
likelihood") 

mi = 0.5 * g / c_xy.sum() 
return mi 


Classifications with Python 

Classification is a well-accepted example of machine learning. It has a 
set of a target classes and training data. Each training data is labeled by 
a particular target class. The classification model is trained by training 
data and predicts the target class of test data. One common application of 
classification is in fraud identification in the credit card or loan approval 
process. It classifies the applicant as fraud or nonfraud based on data. 
Classification is also widely used in image recognition. From a set of 
images, ifyou recognize the image of a computer, it is classifying the image 
of a computer and not of a computer class. 

Sentiment analysis is a popular application of text classification. 
Suppose an airline company wants to analyze its customer textual 
feedback. Then each feedback is classified according to sentiment 
(positive/negative/neutral) and also according to context (about staff/ 
timing/food/price). Once this is done, the airline can easily find out 
what the strength of that airline's staff is or its level of punctuality or cost 
effectiveness or even its weakness. Broadly, there are three approaches in 
classification. 

• Rule-based approach: I will discuss the decision tree 
and random forest algorithm. 

• Probabilistic approach: I will discuss the Naive Bayes 
algorithm. 


Distance-based approach: I will discuss the support 
vector machine. 


57 


CHAPTER 3 SUPERVISED LEARNING USING PYTHON 


Semisupervised Learning 

Classification and regression are types of supervised learning. In this type 
of learning, you have a set of training data where you train your model. 
Then the model is used to predict test data. For example, suppose you 
want to classify text according to sentiment. There are three target classes: 
positive, negative, and neutral. To train your model, you have to choose 
some sample text and lahel it as positive, negative, and neutral. You use 
this training data to train the model. Once your model is trained, you can 
apply your model to test data. For example, you may use the Naive Bayes 
classifier for text classification and try to predict the sentiment of the 
sentence “Food is good." In the training phase, the program will calculate 
the prohahility of a sentence heing positive or negative or neutral when 
the words Food, is, and good are presented separately and stored in the 
model, and in the test phase it will calculate the joint prohahility when 
Food, is, and good all come together. Conversely, clustering is an example 
of unsupervised learning where there is no training data or target class 
availahle. The program learns from data in one shot. There is an instance 
of semisupervised learning also. Suppose you are classifying the text as 
positive and negative sentiments hut your training data has only positives. 
The training data that is not positive is unlaheled. In this case, as the first 
step, you train the model assuming all unlaheled data is negative and apply 
the trained model on the training data. In the output, the data coming in 
as negative should he laheled as negative. Finally, train your model with 
the newly laheled data. The nearest neighhor classifier is also considered 
as semisupervised learning. It has training data, hut it does not have the 
training phase of the model. 
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Decision Tree 


A decision tree is a tree of rules. Each level of the tree represents a 
parameter, each node of the level validates a constraint for that level 
parameter, and each hranch indicates a possihle value of parent node 
parameter. Figure 3-1 shows an example of a decision tree. 



Figure 3-1. Example of decision tree for good weather 

Which Attribute Comes First? 

One important aspect of the decision tree is to decide the order of features. 
The entropy-hased information gain measure decides it. 

Entropy is a measure of randomness of a System. 

c 

Entropy{S)=Y,-Pi^^^2Pi 

i=l 

For example, for any ohvious event like the sun rises in the east, 
entropy is zero, P=l, and log(p)-0. More entropy means more uncertainty 
or randomness in the System. 
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Information gain, which is the expected reductiori in entropy caused 
by partitioning the examples according to this attribute, is the measure 
used in this case. 

Specifically, the information gain, Gain{S,A), of an attribute A relative 
to a collection of examples S is defined as follows: 


Gain[S,A) = Entropy[S) 




ueValues[A) 


|s| 


So, an attribute with a higher information gain will come first in the 
decision tree. 

from sklearn.tree import DecisionTreeClassifier 

df = pd.read_csv('csv file path', index_col=0) 
y = df[target class column ] 

X = df[ coli, col2 ..] 

clf= DecisionTreeClassifier0 

clf.fit(X,y) 

clf.predict(X_test) 


Random Forest Classifier 

A random forest classifier is an extension of a decision tree in which the 
algorithm creates N number of decision trees where each tree has M 
number of features selected randomly. Now a test data will be classified by 
all decision trees and be categorized in a target class that is the output of 
the majority of the decision trees. 
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froffl sklearn.ensemble import RandomForestClassifier 

df = pd.read_csv('csv file path', index_col=0) 

y = df[target class column ] 

X = df[ coli, col2 ..] 

clf=RandoniForestClassifier(n_jobs=2,random_state=0) 

clf.fit(X,y) 

clf.predict(X_test) 

Naive Bayes Classifier 

X = (xl, x2, x3, xn) is a vector of n dimension. The Bayesian 

classifier assigns each X to one of the target classes of set {Cl, C2,Cm,}. 
This assignment is done on the hasis of prohahility that X helongs to target 
class Ci. That is to say, X is assigned to class Ci if and only if P(Ci |X) > P(Cj 
|X) for all j such that 1 < j < m. 


In general, it can he costly computationally to compute P(X|Ci). If each 
component of X can have one of r values, there are r" comhinations 
to consider for each of the m classes. To simplify the calculation, the 
assumption of conditional class independence is made, which means that 
for each class, the attrihutes are assumed to he independent. The classifier 
developing from this assumption is known as the Naive Bayes classifier. 
The assumption allows you to write the following: 

k=l 
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The following code is an example of the Naive Bayes classification of 
numerical data: 

#Import Library of Gaussian Naive Bayes model 
from sklearn.naive_bayes import GaussianNB 
import numpy as np 

#assigning predictor and target variables 
df = pd.read_csv('csv file path', index_col=0) 
y = df[target class column ] 

X = df[ coli, col2 ..] 

#Create a Gaussian Classifier 
model = GaussianNB() 

# Train the model using the training sets 
model.fit(X, y) 

#Predict Output 

print model.predict([input array]) 


Note You’ll see another example of the Naive Bayes classifier in the 
“Sentiment Analysis” section. 


Support Vector Machine 

Ifyou look at Figure 3-2, you can easily understand that the circle and 
square points are linearly separahle in tw^o dimensions (xl, x2). But they 
are not linearly separahle in either dimension xl or x2. The support vector 
machine algorithm works on this theory. It increases the dimension of 
the data until points are linearly separahle. Once that is done, you have 
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to find two parallel hyperplanes that separate the data. This planes are 
known as the margin. The algorithm chose the margins in such a way that 
the distance hetween them is the maximum. That's why it is the maximum 
margin. The plane, which is at the middle of these two margins or at equal 
distance hetween them, is known as an optimal hyperplane that is used to 
classify the test data (see Figure 3-2). The separator can he nonlinear also. 



Figure 3-2. Support vector machine 

The following code is an example of doing support vector machine 
classification using Python: 

froffl sklearn import svm 

df = pd.read_csv('csv file path', index_col=0) 
y = df[target class column ] 

X = df[ coli, col2 ..] 
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model.fit(X, y) 
model.score(X, y) 

print model.predict(x_test) 

Nearest Neighbor Classifier 

The nearest neighbor classifier is a simple distance-based classifier. It 
calculates the distance of test data from the training data and groups 
the distances according to the target class. The target class, which has a 
minimum average distance from the test instance, is selected as the class 
of the test data. A Python example is shown here: 

def Distance(pointl, point2, length): 
distance = 0 
for X in range(length): 

distance += pow((pointl[x] -point2[x]), 2) 
return math.sqrt(distance) 
def getClosePoints(trainingData, testData, k): 
distances = [] 

length = len(testlnstance)-l 
for X in range(len(trainingData)): 

dist = Distance(testData, trainingDatat[x], length) 
distances.append((trainingData[x], dist)) 
distances.sort(key=operator.itemgetter(l)) 
close= [] 

for X in range(k): 

close.append(distances[x][0]) 
return close 
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trainData = [[3,3,3,, 'A'], [5,5,5,, 'B']] 
testData = [7,7,7] 
k = 1 

neighbors = getClosePoints(trainData, testData, l) 
print(neighbors) 

Sentiment Analysis 

Sentiment analysis is an interesting application of text classification. For 
example, say one airline Client wants to analyze its customer feedback. 

It classifies the feedback according to sentiment (positive/negative) and 
also by aspect (food/staff/punctuality). After that, it can easily understand 
its strengths in business (the aspect that has the maximum positive 
feedback) and its weaknesses (the aspect that has the maximum negative 
feedback). The airline can also compare this resuit with its competitor. 

One interesting advantage of doing a comparison with the competitor is 
that it nullifies the impact of the accuracy of the model because the same 
accuracy is applied to all competitors. This is simple to implement in 
Python using the textblob library, as shown here: 

froffl textblob.classifiers import NaiveBayesClassifier 

train = [('I love this sandwich.', 'pos'), ('this is an 
amazing place!', 'pos'),('I feel very good about these 
beers.', 'pos'),('this is my best work.', 'pos'),("what 
an awesome view", 'pos'),('I do not like this restaurant', 
'neg'),('I am tired of this stuff.', 'neg'),("I can't deal with 
this", 'neg'),('he is my sworn enemy!', 'neg'),('my boss is 
horrible.', 'neg')] 
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cl = NaiveBayesClassifier(train) 

print (cl.classify("This is an amazing library!")) 

output : pos 

from textblob.classifiers import NaiveBayesClassifier 

train = [('Air India did a poor job of queue management 
both times.', 'staff Service'), (“The 'cleaning' by flight 
attendants involved regularly spraying air freshener in 
the lavatories.", 'staff'),('The food tasted decent.', 

'food'),('Flew Air India direct from New York to Delhi 
round trip.', 'route'),('Colombo to Moscow via Delhi.', 

'route'),('Flew Birmingham to Delhi with Air India.', 

'route'),('Without toilet, food or anything!', 'food'),('Cabin 
crew announcements included a sincere apology for the delay.', 
'cabin flown')] 

cl = NaiveBayesClassifier(train) 

tests = ['Food is good.'] 
for c in tests: 

printcl.classify(c) 

Output : food 

The textblob library also supports a random forest classifier, which 
Works best on text written in proper English such as a formal letter might 
be. For text that is not usually written with proper grammar, such as 
customer feedback, Naive Bayes works better. Naive Bayes has another 
advantage in real-time analytics. You can update the model without losing 
the previous training. 
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Image Recognition 

Image recognition is a common example of image classification. It is easy 
to use in Python by applying the opencv library. Here is the sample code: 

faceCascade=cv2.CascadeClassifier(cascPath) 

image = cv2.imread(imagePath) 

gray = cv2.cvtColor(image, cv2.C0L0R_BGR2GRAY) 

faces = faceCascade.detectMultiScale( 

gray, 

scaleFactor=l.l, 
minNeighbors=5, 
minSize=(30, 30), 

flags = cv2.cv.CV_HAAR_SCALE_IMAGE 

) 

prinfFound {0} faces! ".format(len(faces)) 

Regression with Python 

Regression realizes a variable as a linear or nonlinear polynomial of a set of 
independent variables. 

Here is an interesting use case: what is the sales price of a product that 
maximizes its profit? This is a million-dollar question for any merchant. 

The question is not straightforward. Maximizing the sales price may 
not resuit in maximizing the profit because increasing the sales price 
sometimes decreases the sales volume, which decreases the total profit. 

So, there will be an optimized value of sales price for which the profit will 
be at the maximum. 

There is N number of records of the transaction with M number of 
features called FI, F2,... Fm (sales price, buy price, cash back, SKU, order 
date, and so on). You have to find a subset of K(K<M) features that have an 
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impact on the profit of the merchant and suggest an optimal value of 
VI, V2,... Vk for these K features that maximize the revenue. 

You can calculate the profit of merchant using the foliowing formula: 

Profit - (SP - TCB - BP) * SV (1) 

For this formula, here are the variahles: 

• SP - Sales price 

• TCB - Total cash hack 

• BP - Buy price 

• SV - Sales volume 

Now using regression, you can realize SV as follows: 

SV - a + h*SP + c*TCB + d*BP 

Now you can express profit as a function of SP, TCB, and BP and use 
mathematical optimization. With constraining in all parameter values, 
you can get optimal values of the parameters that maximize the profit. 

This is an interesting use case of regression. There are many scenarios 
where one variahle has to he realized as the linear or polynomial function 
of other variahles. 

Least Square Estimation 

Least square estimation is the simplest and oldest method for doing 
regression. It is also known as the curve fitting method. Ordinary least 
squares (OLS) regression is the most common technique and was invented 
in the 18th century hy Cari Friedrich Gauss and Adrien-Marie Legendre. 
The following is a derivation of coefficient calculation in ordinary least 
square estimation: 
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X'Y^X'XP + X'g 
X'Y^XX p + 0 

{xxy'x'Y=p+o 

p^{xxy'x'Y 


The following code is a simple example of OLS regression: 
frofflscipyimport stats 

df = pd.read_csv('csv file path', index_col=0) 

y = df[target column ] 

X = df[ coli, col2 ..] 

X=sni.add_constant(X) 

slope, intercepi, r_value, p_value, std_err = stats. 
linregress(X,y) 

Logistic Regression 

Logistic regression is an interesting application of regression that 
calculates the probability of an event. It introduces an intermediate 
variable that is a linear regression of linear variable. Then it passes 
the intermediate variable through the logistic function, which maps 
the intermediate variable from zero to one. The variable is treated as a 
probability of the event. 

The following code is an example of doing logistic regression on 
numerical data: 

import pandas as pd 

import statsmodels.api as sm 

df = pd.read_csv('csv file path', index_col=0) 
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y = df[target column ] 

X = df[ coli, col2 ..] 

X=sm.add_constant(X) 
logit=sm.Logit(y,X) 
result=logit.fit() 
resuit. summaryO 

Classification and Regression 

Classification and regression may be applied on the same problem. For 
example, when a bank approves a credit card or loan, it calculates a credit 
score of a candidate based on multiple parameters using regression and 
then sets up a threshold. Candidates having a credit score greater than 
the threshold are classified as potential nonfraud, and the remaining are 
considered as potential fraud. Likewise, any binary classification problem 
can be solved using regression by applying a threshold on the regression 
resuit. In Chapter 4, 1 discussed in detail how to choose a threshold 
value from the distribution of any parameter. Similarly, some binary 
classifications can be used in place of logistic regression. For instance, 
say an e-commerce site wants to predict the chances of a purchase order 
being converted into a successful invoice. The site can easily do so using 
logistic regression. The Naive Bayes classifier can be directly applied 
on the problem because it calculates probability when it classifies the 
purchase order to be in the successful or unsuccessful class. The random 
forest algorithm can be used in the problem as well. In that case, among 
the N decision tree, if the M decision tree says the purchase order will be 
successful, then M/N will be the probability of the purchase order being 
successful. 
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Intentionally Bias the ModeI to Over-Fit or 
Under-Fit 


Sometimes you need to over- or under-predict intentionally. In an 
auction, when you are predicting from the buy side, it will always be good 
if your bid is little lower than the original. Similarly, on the sell side, it 
is desired that you set the price a little higher than the original. You can 
do this in two ways. In regression, when you are selecting the features 
using correlation, over-predicting intentionally drops some variable with 
negative correlation. Similarly, under-predicting drops some variable with 
positive correlation. There is another way of dealing with this. When you 
are predicting the value, you can predict the error in the prediction. To 
over-predict, when you see that the predicted error is positive, reduce the 
prediction value by the error amount. Similarly, to over-predict, increase 
the prediction value by the error amount when the error is positive. 

Another problem in classification is biased training data. Suppose 
you have two target classes, A and B. The majority (say 90 percent) of 
training data is class A. So, when you train your model with this data, all 
your predictions will become class A. One solution is a biased sampling 
of training data. Intentionally remove the class A example from training. 
Another approach can be used for binary classification. As class B is a 
minority in the prediction probability of a sample, in class B it will always 
be less than 0.5. Then calculate the average probability of all points coming 
into class B. For any point, if the class B probability is greater than the 
average probability, then mark it as class B and otherwise class A. 
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The following code is an example of a tuning classification condition: 

y_test_increasing, predicted_increasing = predict(dl, True, False) 
y_test_decreasing, predicted_decreasing = predict(d2, False, False) 

prob_increasing = predicted_increasing[:,1] 
increasing_mean = prob_increasing.mean() 
increasing_std = prob_increasing.std() 
prob_decreasing = predicted_decreasing[:,0] 
decreasing_mean = prob_decreasing.mean() 
decreasing_std = prob_decreasing.std() 
ifi> 0: 

mean_std_inc = (mean_std_inc + increasing_std)/2 
mean_std_dec = (mean_std_dec + decreasing_std)/2 
else: 

mean_std_inc = increasing_std 
mean_std_dec = decreasing_std 

for j in range(len(y_test_decreasing)-l): 

ac_status = y_test_increasing[j] + y_test_ 
decreasing[j] 
pr_status = 0 
if True: 

inc = (prob_increasing[j] - 
increasing_mean + niean_std_inc) 
dec = (prob_decreasing[j] - 
decreasing_mean + mean_std_dec) 
ifino 0 or deo 0: 
ifinodec: 

pr_status = 1 

else: 

pr_status = -1 

else: 

pr_status = 0 
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Dealing with Categorical Data 

For algorithm-like support, vector or regression input data must be 
numeric. So, if you are dealing with categorical data, you need to convert 
to numeric data. One strategy for conversion is to use an ordinal number 
as the numerical score. A more sophisticated way to do this is to use 
an expected value of the target variable for that value. This is good for 
regression. 

for coi in X.columns: 

avgs = df.groupby(col, as_index=False)['floor']. 
aggregate(np.mea n) 
fori,row in avgs.iterrows(): 
k = row[col] 

V = row['floor'] 

X.loc[X[col] == k, coi] = V 

For logistic regression, you can use the expected probability of the 
target variable for that categorical value. 

for coi in X.columns: 

if str(col) != 'success': 

if str(col) not in index: 

feature_prob = X.groupby(col).size(). 
div(len(df)) 

cond_prob = X.groupby(['success', 
str(col)]).size().div(len(df)).div(feature_ 
prob, axis=0, level=str(col)).reset_ 
index(name="Probability") 
cond_prob = cond_prob[cond_prob.success != '0'] 
cond_prob.drop("success",inplace=True, axis=l) 
cond_prob['feature_value'] = cond_ 
prob[str(col)].apply(str).as_matrix() 
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cond_prob.drop(str(col),inplace=True, axis=l) 
for i, row in cond_prob.iterrows(): 

k = row["feature_value"] 

V = row["Probability"] 
X.loc[X[col] == k, coi] = V 

else: 

X.drop(str(col),inplace=True, axis=l) 

The following example shows how to deal with categorical data and 
how to use correlation to select a feature. The following is the complete 
code of data preprocessing. The data for this code example is also availahle 
Online at the Apress wehsite. 

def process_real_time_data(time_liniit): 

df = pd.read_json(json.loads(<input>)) 

df.replace('''\s+', regex=True, inplace=True) #front 

df.replace('\s+$', regex=True, inplace=True) #end 

time_limit = df['server_time'].max() 

df['floor'] = pd.to_numeric(df['floor'], 
errors='ignore') 

df['client_time'] = pd.to_numeric(df['client_time'], 
errors='ignore') 

df['client_time'] = df.apply(lambda row: get_hour(row. 
client_time), axis=l) 

y = df['floor'] 

X = df[['ip','size','domain','client_time','device','ad_ 
position','client_size','root']] 

X_back = df[['ip'/size'/domain'/client_ 

time','device','ad_position','client_size','root']] 
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for coi in X.columns: 

avgs = df.groupby(col, as_index=False)['floor']. 
aggregate(np.mean) 
for index,row in avgs.iterrows(): 
k = row[col] 

V = row['floor'] 

X.loc[X[col] == k, coi] = V 

X.drop('ip', inplace=True, axis=l) 

X_back.drop('ip', inplace=True, axis=l) 

X_train, X_test, y_train, y_test = cross_validation. 
train_test_split(X, y, test_size= 0, randoni_state=42) 


X_train = X_train.astype(float) 
y_train = y_train.astype(float) 


X_train = np.log(X_train + l) 
y_train = np.log(y_train + l) 


X_train = X_train.as_matrix() 
y_train = y_train.as_matrix() 


index = [] 
il = 0 

processed = 0 
while(l): 

flag = True 

for i in range(X_train.shape[l]): 
if i > processed : 

#print(il,X_train.shape[l],X.columns[il]) 
il = il + 1 

corr = pearsonr(X_train[:,i], y_train) 
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PEr= .674 * (l- corr[0]*corr[0])/ 
(len(X_train[:,i])**(l/2.0)) 
if corr[0] < PEr: 

X_train = np.delete(X_train,i,l) 
index.append(X.columns[il-l]) 
processed = i - 1 
flag = False 
break 

if flag: 

break 

return y_train, X_train, y, X_back, X, time_limit, 
index 
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Unsupervised 
Learning: Clustering 

In Chapter 3 we discussed how training data can be used to categorize 
customer comments according to sentiment (positive, negative, neutral), 
as well as according to context. For example, in the airline domain, the 
context can be punctuality, food, comfort, entertainment, and so on. Using 
this analysis, a business owner can determine the areas that his business 
he needs to concentrate on. For instance, if he observes that the highest 
percentage of negative comments has been about food, then his priority 
will be the quality of food being served to the customers. However, there 
are scenarios where business owners are not sure about the context. There 
are also cases where training data is not available. Moreover, the frame 
of reference can change with time. Classification algorithms cannot be 
applied where target classes are unknown. A clustering algorithm is used 
in these Idnds of situations. A conventional application of clustering is 
found in the wine-maldng industry where each cluster represents a brand 
of wine, and wines are clustered according to their component ratio in 
wine. In Chapter 3, you learned that classification can be used to recognize 
a type of image, but there are scenarios where one image has multiple 
shapes and an algorithm is needed to separate the figures. Clustering 
algorithms are used in this Idnd of use case. 


© Sayan Mukhopadhyay 2018 

S. Mukhopadhyay, Advanced Data Analytics Using Python, 
https://dol.org/10.1007/978-l-4842-3450-l_4 
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Clustering classifies objects into groups based on similarity or 
distance measure. This is an example of unsupervised learning. The main 
difference between clustering and classification is that the latter has well- 
defined target classes. The characteristics of target classes are defined by 
the training data and the models learned from it. That is why classification 
is supervised in nature. In contrast, clustering tries to define meaningful 
classes based on data and its similarity or distance. Figure 4-1 illustrates a 
document clustering process. 



Figure 4-1. Document clustering 

K-Means Clustering 

Let’s suppose that a retail distributer has an online system where local 
agents enter trading information manually. One of the fields they have 
to fili in is City. But because this data entry process is manual, people 
normally tend to make lots of spelling errors. For example, instead of 
Delhi, people enter Dehi, Dehli, Delly, and so on. You can try to solve 
this problem using clustering because the number of clusters are already 
known; in other words, the retailer is aware of how many cities the agents 
operate in. This is an example of K-means clustering. 

The K-means algorithm has two inputs. The first one is the data X, which 
is a set of N number of vectors, and the second one is K, which represents 
the number of clusters that needs to be created. The output is a set of K 
centroids in each cluster as well as a label to each vector in X that indicates 
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the points assigned to the respective cluster. All points within a cluster are 
nearer in distance to their centroid than any other centroid. The condition 
for the K clusters and the K centroids /i*: can he expressed as follows: 


minimize ZZIIx n “ II with respect to Q, 


However, this optimization prohlem cannot he solved in polynomial 
time. But Lloyd has proposed an iterative method as a solution. It consists 
of two steps that iterate until the program converges to the solution. 

1. It has a set of K centroids, and each point is assigned 
to a unique cluster or centroid, where the distance 
of the concerned centroid from that particular point 
is the minimum. 

2. It recalculates the centroid of each cluster hy using 
the following formula: 


Cj. = |Xn: \\Xn - < all||Xn - /jZ||| 


( 1 ) 



( 2 ) 


The two-step procedure continues until there is no further re-arrangement 
of cluster points. The convergence of the algorithm is guaranteed, hut it may 
converge to a local minima. 

The following is a simple implementation of Lloyd's algorithm for 
performing K-means clustering in Python: 

import random 

def ED(source, target): 

if source == 

return len(target) 

if target == 

return len(source) 
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if source[-l] == target[-l]: 
cost = 0 
else: 
cost = 1 

res = niin( [ED(source[:-l], target)+l, 
ED(source, target[:-1])+l, 

ED(source[:-1], target[:-l]) + cost]) 
return res 

def find_centre(x, X, mu): 

min = 100 

cent = 0 

for c in mu: 

dist = ED(x, X[c]) 

if dist< min: 

min = dist 

cent = c 

return cent 

def cluster_arrange(X, cent): 
clusters = {} 
for X in X: 

bestcent = find_centre(x, X, cent) 
try: 

clusters[bestcent].append(x) 
exceptKeyError: 
clusters[bestcent] = [x] 
return clusters 

def rearrange_centers(cent, clusters) 
newcent = [] 

keys = sorted(clusters.keys()) 
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def has_converged(cent, oldcent): 
return sorted(cent) == sorted(oldcent) 

def locate_centers(X, K): 
oldcent = random.sample(range(0,5), K) 
cent = random.saniple(range(0,5), K) 
while not has_converged(cent, oldcent): 
oldcent = cent 

# Assign all points in X to clusters 
clusters = cluster_arrange(X, cent) 

# Reevaluate centers 

cent = rearrange_centers(oldcent, clusters) 
return(cent, clusters) 

X= ['Delhi'/Dehli', 'Delii'/Kolkata'/KalkataKalkota' ] 
print(locate_centers(X,2)) 

However, K-means clustering has a limitation. For example, suppose 
all of your data points are located in eastern India. For K-4 clustering, 
the initial step is that you randomly choose a center in Delhi, Mumbai, 
Chennai, and Kolkata. All of your points lie in eastern India, so all 
the points are nearest to Kolkata and are always assigned to Kolkata. 
Therefore, the program will converge in one step. To avoid this prohlem, 
the algorithm is run multiple times and averaged. Programmers can use 
various tricks to initialize the centroids in the first step. 
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Choosing K: The Elbow Method 

There are certain cases where you have to determine the K in K-means 
clustering. For this purpose, you have to use the elhow method, which 
uses a percentage of variance as a variahle dependent on the numher 
of clusters. Initially, several clusters are chosen. Then another cluster is 
added, which doesn’t make the modeling of data much hetter. The numher 
of clusters is chosen at this point, which is the elbow criterion. This "elhow” 
cannot always he unamhiguously identified. The percentage of variance is 
realized as the ratio of the hetween-group variance of individual clusters 
to the total variance. Assume that in the previous example, the retailer 
has four cities: Delhi, Kolkata, Mumhai, and Chennai. The programmer 
does not know that, so he does clustering with K-2 to K-9 and plots the 
percentage of variance. He will get an elhow curve that clearly indicates 
K-4 is the right numher for K. 


Distance or Similarity Measure 

The measure of distance or similarity is one of the key factors of clustering. 
In this section, I will descrihe the different kinds of distance and similarity 
measures. Before that, ITI explain what distance actually means here. 

Properties 

The distances are measures that satisfy the following properties: 

• dist(x, y) - 0 if and only if x-y. 

• dist(x, y) > 0 when x y. 

• dist(x, y) - dist(x, y). 

• dist(x,y) + dist(y,z) >= d(z,x) for all x, y, and z. 
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General and Euclidean Distance 


The distance between the points p and q is the length of the geometrical 
line between them: ( pq). This is called Euclidean distance. 

According to Cartesian coordinates, if p - (pi, pa,..., p„) and 
^7 - {(E, q 2 ,---, qn) are the two points in Euclidean n-space, then the 
distance (rf) from qtopor from p to is derived from the Pythagorean 
theorem, as shown here: 



The Euclidean vector is the position of a point in a Euclidean n-space. 
The magnitude of a vector is a measure. Its length is calculated by the 
following formula: 



A vector has direction along with a distance. The distance between two 
points, p and q, may have a direction, and therefore, it may be represented 
by another vector, as shown here: 


q-p^ [qi-pu q2-p2, ■■■■, qn-Pn) 


The Euclidean distance between p and q is simply the Euclidean length 
of this distance (or displacement) vector. 
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In one dimension: 




x-y\ 


In two dimensions: 

In the Euclidean plane, if p - (pi, P 2 ) and q - {q^, q^, then the distance 
is given by the following: 

d[p,q) = ^J{ql-pxY2 + {q2-p2)'^2 

Alternatively, it follows from the equation that if the polar coordinates 
of the point p are (ri, 0 i) and those of q are (ra, 62 ), then the distance 
between the points is as follows: 

Vri^2 + r2^2 - 2rir2COs(®i - ^ 2 ) 


In n dimensions: 

In the general case, the distance is as follows: 

D\p,q) ^ {Pi - chf + (p2 - q2y + ... + (Pi - qd^ + ... + (p„ - qnf. 

In Chapter 3, you will find an example of Euclidian distance in the 
nearest neighbor classifier example. 

Squared Euclidean Distance 

The Standard Euclidean distance can be squared to place progressively 
greater weight on objects that are farther apart. In this case, the equation 
becomes the following: 


cP[p,q) ^ (pi - qif + (p2 - q2y + ... + (p,- - qd^ + ... + (p„ - qnf- 


Squared Euclidean distance is not a metric because it does not satisfy 
the triangle inequality. However, it is frequently used in optimization 
problems in which distances are to be compared only. 
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Distance Between String-Edit Distance 

Edit distance is a measure of dissimilarity between two strings. It counts 
the minimum number of operations required to make two strings 
identical. Edit distance finds applications in natural language processing, 
where automatic spelling corrections can indicate candidate corrections 
for a misspelled word. Edit distance is of two types. 

• Levenshtein edit distance 

• Needleman edit distance 

Levenshtein Distance 

The Levenshtein distance between two words is the least number of 
insertions, deletions, or replacements that need to be made to change one 
word into another. In 1965, it was Vladimir Levenshtein who considered 
this distance. 

Levenshtein distance is also loiown as edit distance, although that 
might denote a larger family of distance metrics as well. It is affiliated with 
pair-wise string alignments. 

Lor example, the Levenshtein distance between Calcutta and Kolkata 
is 5, since the following five edits change one into another: 

Calcutta ^ Kalcutta (substitution of C for K) 

Kalcutta ^ Kolcutta (substitution of a for o) 

Kolcutta ^ Kolkutta (substitution of c for k) 

Kolkutta ^ Kolkatta (substitution of u for a) 

Kolkatta ^ Kolkata (deletion of t) 

When the strings are identical, the Levenshtein distance has several 
simple upper bounds that are the lengths of the larger strings and the 
lower bounds are zero. The code example of the Levinstein distance is 
given in K-mean clustering code. 
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Needieman-Wunsch Algorithm 

The Needieman-Wunsch algorithm is used in hioinformatics to align 
protein or nucleotide sequences. It was one of the first applications of 
dynamic programming for comparing hiological sequences. It works 
using dynamic programming. First it creates a matrix where the rows and 
columns are alphahets. Each cell of the matrix is a similarity score of the 
corresponding alphahet in that row and column. Scores are one of three 
types: matched, not matched, or matched with insert or deletion. Once 
the matrix is filled, the algorithm does a hacktracing operation from the 
hottom-right cell to the top-left cell and finds the path where the neighhor 
score distance is the minimum. The sum of the score of the haclctracing 
path is the Needieman-Wunsch distance for two strings. 

Pyopa is a Python module that provides a ready-made Needieman- 
Wunsch distance hetween two strings. 

import pyopa 

data = {'gap_open': -20.56, 

'gap_ext': -3.37, 

'pam_distance': 150.87, 

'scores': [[10.0]], 

'coluffln_order': 'A', 

'threshold': 50.0} 

env = pyopa.create_environment(**data) 

sl = pyopa.Sequence('AAA') 
s2 = pyopa.Sequence('TTT') 
print(pyopa.align_double(sl, sl, env)) 

print(env.estimate_pani(aligned_strings[o], aligned_strings[l])) 
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Although Levenshtein is simple in implementation and 
computationally less expensive, ifyou want to introduce a gap in string 
matching (for example, New Delhi and NewDelhi), then the Needleman- 
Wunsch algorithm is the better choice. 


Similarity in the Context of Document 

A similarity measure between documents indicates how identical two 
documents are. Generally, similarity measures are bounded in the range 
[-1,1] or [0,1] where a similarity score of 1 indicates maximum similarity. 

Types of Similarity 

To measure similarity, documents are realized as a vector of terms 
excluding the stop words. Let's assume that A and B are vectors 
representing two documents. In this case, the different similarity measures 
are shown here: 

• Dice 

The Dice coefficient is denoted by the following: 

. ^ ^ lAnBl 

sim(q,dj) - D(A,B) - — ^7— 

a|A|-i-(l-a)|B| 

Also, 

a G [0,1] and let a = ^ 

• Overlap 

The Overlap coefficient is computed as follows: 

Sim(q,dj) ^ 0(A,B) ^ 

min(|A| 

The Overlap coefficient is calculated using the max 
operator instead of min. 
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• Jaccard 


The Jaccard coefficient is given by the following: 


Sim(q,dj) ^ J(A,B) ^ 



The Jaccard measure signifies the degree of relevance. 

• Cosine 

The cosine of the angie between two vectors is given by 
the foiiowing: 

. ,, . ^ lAnBl 

Sim(q,dj) - C(A,B) - ^ 



Distance and simiiarity are two opposite measures. For example, 
numeric data correlation is a simiiarity measure, and Euclidian distance 
is a distance measure. Generally, the value of the simiiarity measure is 
limited to between 0 and 1, but distance has no such upper boundary. 
Simiiarity can be negative, but by definition, distance cannot be negative. 
The clustering algorithms are almost the same as from the beginning of 
this field, but researchers are continuously finding new distance measures 
for varied applications. 


What Is Hierarchical Clustering? 


Hierarchical clustering is an iterative method of clustering data objects. 
There are two types. 

• Agglomerative hierarchical algorithms, or a bottom-up 
approach 

• Divisive hierarchical algorithms, or a top-down 
approach 
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Bottom-Up Approach 

The bottom-up clustering method is called agglomerative hierarchical 
clustering. In this approach, each input object is considered as a separate 
cluster. In each iteration, an algorithm merges the two most similar clusters 
into only a single cluster. The operation is continued until all the clusters 
merge into a single cluster. The complexity of the algorithm is 0(n^3). 

In the algorithm, a set of input objects, I - {Ii,l 2 ,. is given. A set 
of ordered triples is <D,K,S>, where D is the threshold distance, K is the 
number of clusters, and S is the set of clusters. 

Some variations of the algorithm might allow multiple clusters with 
identical distances to be merged into a single iteration. 

Algorithm 

Input: I^{Ii,l2,...., U 
Output: O 

fori = 1 to n do 

Ci ^ {Ii}; 

end for 

D ^ 0; 

K ■<- n; 

S ^ {Cl,., Cn}; 

0 ^ <d, k, S>; 

repeat 

Dist •«- CalcultedMinimumDistance(S); 

D •«— oo; 

Fori = 1 to K-1 do 

Forj = i+1 to Kdo 

ifDist(i, j)< Dthen 

Dist(i, j); 
u ^ i; 

V ^ j; 
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end if 

end for 

end for 

K ^ K-l; 

Cnew •«- Cu uCv; 

S •«- Su Cnew -Cu - Cv; 

0 ^ Ou<D, K, S> 

Until K = 1; 

A Python example of hierarchical clustering is given later in the chapter. 

Distance Between Clusters 

In hierarchical clustering, calculating the distance hetween two clusters is 
a critical step. There are three methods to calculate this. 

• Single linkage method 

• Complete linlcage method 

• Average linkage method 

Single Linkage Method 

In the single linkage method, the distance hetween two clusters is the 
minimum distance of all distances hetween pairs of ohjects in two clusters. 
As the distance is the minimum, there will he a single pair of ohjects that 
has less than equal distance hetween two clusters. So, the single linkage 
method may he given as follows: 

Dist(Ci, Cj) - min dist(X, Y) 

X€Ci ,Y€Cj 
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Complete Linkage Method 

In the complete linkage method, the distance between two clusters is the 
maximum distance of ali distance between pairs of objects in two clusters, 
The distance is the maximum, so all pairs of distances are less than equal 
than the distance between two clusters. So, the complete linkage method 
can be given by the following: 

Dist(Ci, Cj) - max dist(X, Y) 


X€Ci,Y€Cj 


Average Linkage Method 

The average linkage method is a compromise between the previous two 
linkage methods. It avoids the extremes of large or compact clusters. The 
distance between clusters Ci and Cj is defined by the following: 


Dist(Cj,Cj) = 


^ X€ Ci^ Y€Cj dist(X, Y) 

|cT|c,| 


I Ck I is the number of data objects in cluster C^. 

The centroid linkage method is similar to the average linkage method, 
but here the distance between two clusters is actually the distance between 
the centroids. The centroid of cluster Ci is defined as follows: 

Xe^(ci,...., cj,with 

Cj ^ 1/m XXkj, 

Xkj is the j-th dimension of the k-th data object in cluster Ci. 
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Top-Down Approach 

The top-down clustering method is also called the divisive hierarchical 
dustering. It the reverse of hottom-up clustering. It starts with a single 
cluster consisting of all input ohjects. After each iteration, it splits the 
cluster into two parts having the maximum distance. 

Algorithm 

Input: I - {II, 12,, In} 

Output: O 

D •«— oo; 

K ^ l; 

S •<— , ... , In}j 

0 ^ <D, K, S >; 

repeat 

X containing two data objects with the longest 
distance dist; 

Y ^0; 

S ^ S - X; 

Xi •«- data object in A with maximum D(Xi, X); 

X ^ X - {XJ; 

Y ^ Y u{Xi}; 

repeat 

fora 11 data object Xj in Xdo 

e(j) ^ D(X 3 , X) - D(Xj, Y); 

end for 

if3e(j) > 0 then 

X|< data object in X with maximum e(j); 
X - {XJ; 

Y ^ Y u {XJ; 
split ^ TRUTH; 

else 
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split^ FALSE; 

end if 

untilsplit == FALSE; 

D ■<- dist; 

K ^ K+l; 

S ^ Su X u Y 
0 ^ 0 U <D, K, S>; 

Until K = n; 

A dendrogram O is an output of any hierarchical clustering. Figure 4-2 
illustratas a dendrogram. 


Cluster Dendrogram 
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Figure 4-2. A dendrogram 
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To create a cluster from a dendrogram, you need a threshold of 
distance or similarity. An easy way to do this is to plot the distribution 
of distance or similarity and find the inflection point of the curve. For 
Gaussian distribution data, the inflection point is located at x - mean + 
n*std and x - mean - n*std, as shown Figure 4-3. 



Figure 4-3. The inflection point 


The following code creates a hierarchical cluster using Python: 

From numpy import * 
class cluster_node: 

def \ _init_(self,vecl,leftl=None,rightl=None,distancel=0.0,i 

dl=None,countl=l): 
self.leftl=leftl 
self.rightl=rightl 
self.vecl=vecl 

self.idl=idl 

self.distancel=distancel 

self.countl=countl #only used for weighted average 
def L2dist(vl,v2): 

return sqrt(sum((vl-v2)**2)) 
def hcluster(featuresl,distancle=L2dist): 
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#cluster the rows of the "features" matrix 

distancesl={} 

currentclustidl=-l 

# clusters are initially just the individual rows 
clustl=[cluster_node(array(featuresl[il]),idl=il) 
for il in range(len(featuresl))] 
while len(clustl)>l: 

lowestpairl=(0,l) 

closestl=distance(clustl[o].vecl,clustl[l].vecl) 

# loop through every pair looking for the 

smallest distance 

for il in range(len(clustl)): 

for jl in range(i+l,len(clust)): 

# distances is the cache of 
distance calculations 
if (clustl[il].idl,clustl[jl]. 
idl) not in distancesl: 
distances[(clustl[il].idl,clustl[jl].idl)]=\ 
distancel(clustl[il].vecl,clustl[jl].vecl) 
dl=distancesl[(clustl[il].idl,clustl[jl].idl)] 

if dl< closestl: 

closestl=dl 
lowestpairl=(il,jl) 

# calculate the average of the two clusters 
mergevecl=[(clustl[lowestpairl[lO]].vecl[il]\ 
+clustl[lowestpairl[l]].vecl[il])/2.0 \ 

For i in range(len(clustl[o].vecl))] 

# create the new cluster 
newclusterl=cluster_nodel(array(mergevecl),\ 

leftl=clustl[lowestpairl[o]],\ 

rightl=clustl[lowestpairl[l]],\ 
distancel=closeslt,idl=currentclustidl) 
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# cluster ids that weren't in the original 
set are negative 
currentclustidl-=l 
delclustl[lowestpairl[l]] 
delclustl[lowestpairl[o]] 
clustl.append(newclusterl) 
return clustl[o] 

The previous code will create the dendogram. Creation of the cluster 
from that dendogram using some threshold distance is given hy the 
following: 

def extract_clusters(clustl, distl): 

# extract list of sub-tree clusters from h-cluster tree 
with distance <dist 

clustersl = {} 
if clust.distanceKdisl: 

# we have found a cluster subtree 

return [clustl] 

else: 

# check the right and left branches 
cll = [] 
crl = [] 

if clustl.leftl!=None: 

cl = extract_clusters(clustl.leftl,distl=distl) 
if clustl.rightl!=None: 

crl = extract_clusters(clustl. 
rightl,distl=distl) 
return cll+crl 
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Graph Theoretical Approach 

The clustering problem can be mapped to a graph, where every node in 
the graph is an input data point. If the distance between two graphs is 
less than the threshold, then the corresponding nodes are connected. 
Now using the graph partition algorithm, you can cluster the graph. One 
industry example of clustering is in investment banking, where the cluster 
instruments depend on the correlation of their time series of price and 
performance trading of each cluster taken together. This is known as 
basket trading in algorithmic trading. So, by using the similarity measure, 
you can construet the graph where the nodes are instruments and the 
edges between the nodes indicate that the instruments are correlated. To 
create the basket, you need a set of instruments where all are correlated 
to each other. In a graph, this is a set of nodes or subgraphs where all the 
nodes in the subgraph are connected to each other. This kind of subgraph 
is known as a clique. Finding the clique of maximum size is an NP- 
complete problem. People use heuristic Solutions to solve this problem of 
clustering. 

How Do You Know If the Clustering Resuit Is 
Good? 

After applying the clustering algorithm, verifying the resuit as good or bad 
is a crucial step in cluster analysis. Three parameters are used to measure 
the quality of cluster, namely, centroid, radius, and diameter. 


Centroid = 



N 
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Radius = R 


m 


N 


r 


Diameter = D 



m 


(N)(N-1) 


Ifyou consider the cluster as a circle in a space surrounding ali 
member points in that cluster, then you can take the centroid as the center 
of the circle. Similarly, the radius and diameter of the cluster are the radius 
and diameter of the circle. Any cluster can be represented by using these 
three parameters. One measure of good clustering is that the distance 
between centers should be greater than the sum of radius. 

General measures of the goodness of the machine learning algorithm 
are precision and recall. If A denotes the set of retrieved results, B denotes 
the set of relevant results, P denotes the precision, and R denotes the 
recall, then: 



R(A,B) 


AnB 


B 
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CHAPTER 5 


Deep Learning 
and Neural Networks 

Neural networks, specifically known as artificial neural networks 
(ANNs), were developed by the inventor of one of the first neurocomputers, 
Dr. Robert Hecht-Nielsen. He defines a neural network as follows: 

"...a computing system made up of a number of simple, highly 
interconnected processing elements, which process information by their 
dynamic state response to external inputs." 

Customarily, neutral networks are arranged in multiple layers. The 
layers consist of several interconnected nodes containing an activation 
function. The input layer, communicating to the hidden layers, delineates 
the patterns. The hidden layers are linked to an output layer. 

Neural networks have many uses. As an example, you can cite the fact 
that in a passenger load prediction in the airline domain, passenger load 
in month t is heavily dependent on t-12 months of data rather on M or t-2 
data. Hence, the neural network normally produces a better resuit than 
the time-series model or even image classification. In a chatbot dialogue 
System, the memory network, which is actually a neural network of a bag 
of words of the previous conversation, is a popular approach. There are 
many ways to realize a neural network. In this book, I will focus only the 
backpropagation algorithm because it is the most popular. 


© Sayan Mukhopadhyay 2018 

S. Mukhopadhyay, Advanced Data Analytics Using Python, 
https://dol.org/10.1007/978-l-4842-3450-l_5 
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Backpropagation 

Backpropagation, which usually substitutes an optimization method 
like gradient descent, is a common method of training artificial neural 
networks. The method computes the error in the outermost layer and 
backpropagates up to the input layer and then updates the weights as 
a function of that error, input, and learning rate. The final resuit is to 
minimize the error as far as possible. 

Backpropagation Approach 

Problems like the noisy image to ASCII examples are challenging to solve 
through a computer because of the basic incompatibility between the 
machine and the problem. Nowadays, computer Systems are customized 
to perform mathematical and logical functions at speeds that are beyond 
the capability of humans. Even the relatively unsophisticated desktop 
microcomputers, widely prevalent currently, can perform a massive 
number of numeric comparisons or combinations every second. 

The problem lies in the inherent sequential nature of the computer. 
The “fetch-execute" cycle of the von Neumann architecture allows the 
machine to perform only one function at a time. In such cases, the time 
required by the computer to perform each instruction is so short that the 
average time required for even a large program is negligible to users. 

A new Processing system that can evaluate all the pixels in the image in 
parallel is referred to as the backpropagation network (BPN). 

Generalized Delta Rule 

I will now introduce the backpropagation learning procedure for 
knowing about internal representations. A neural network is termed a 
mapping network if it possesses the ability to compute certain functional 
relationships between its input and output. 
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SupposeasetofPvectorpairs, (xi,yy),(x 2 ,y 2 ),---,{xp,yp ),whichare 
examples of the functional mapping y = (l)[x):x ,y g . 

The equations for information processing in the three-layer network 
are shown next. An input vector is as follows: 

/=1 


An output node is as follows: 


The equations for output nodes are as follows: 


,/=i 

Opk ^fk(nefp^) 


Update of Output Layer Weights 

The following equation is the error term: 


z ^ 

dE^ , . a/; S(net%) 

8(o: ~ 


8 [net;,) 8colj 


The last factor in the equation of the output layer weight is as follows: 

8 [neti,) ^ 


8co; 


8m° ® * 

i=i 


■^pj 
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The negative gradient is as follows: 

QE 

The weights on the output layer are updated according to the following: 

(^ + 1) ^ ^°kj (0 + (0 

^p<j = h (ypk - Op ,)/;' (nefp ,) 


There are two forms of the output function. 

• fk(net°,)^net°^ 


Update of Hidden Layer Weights 

Look closely at the network illustrated previously consisting of one layer 
of hidden neurons and one output neuron. When an input vector is 
circulated through the network, for the current set of weights there is an 
output prediction. Automatically, the total error somehow relates to the 
output values in the hidden layer. The equation is as follows: 


^p pk ^pk) 

^ k 


^ H 'v ./ 


\2 


JJ 
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You can exploit the fact to calculate the gradient of Ep with respect to 
the hidden layer weights. 

^-y( _ \ di^j yy) 

P '' d(ney Bipj d(net^j) dco]^ 

Each of the factors in the equation can he calculated explicitly from the 
previous equation. The resuit is as follows: 

dE 

yr = -Z(^M -Opk)f:’(net;,)co;fynet';j)x^, 

BPN Summary 

Apply the input vector ,... ) to the input units. 

Calculate the net input values to the hidden layer units. 

/=1 

Calculate the outputs from the hidden layer. 

'«=/,*("<) 


Calculate the net input values to each unit. 


y=i 
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Calculate the outputs. 

Opk ^fk{net;,) 

Calculate the error terms for the output units. 

S°pk ^(ypk -Op^)ft(netl^) 

Calculate the error terms for the hidden units. 

k 

Update weights on the output layer. 

ffl;.(t + i) = ©;.(?)+775;,/^. 

Update weights on the hidden layer. 

a)‘(t + l) = a)*(() +1)5*1, 


Backpropagation Algorithm 

Let’s see some code: 

class NeuralNetwork(object): 

def backpropagate(self,x,y): 

"""Return a tuple "(nabla_b, nabla_w)" 
representing the 

gradient for the cost function C_x. "nabla_b" and 
"nabla_w" are layer-by-layer lists of numpy 
arrays, similar 
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to "self.biases" and "self.weights".. 

nabla_bl=[np.zeros(bl.shape)for b lin self.biases] 
nabla_wl=[np.zeros(wl.shape)for wlin self.weights] 

# feedforward 
activationl=xl 
activationsl=[xl] 
zsl=[] 

for b,winzip(self.biases,self.weights): 

zl=np.dot(wl,activationl)+bl 

zsl.append(zl) 

activationl=sigmoid(zl) 

activationsl.append(activationl) 

# backward pass 

deltal=self.cost_derivative(activationsl[-1],yl)* \ 

signioid_prime(zsl[-l]) 

nabla_bl[-l]=deltal 

nabla_wl[-l]=np.dot(delta,activationsl[-2]. 
transposeO) 

for 1 in xrange(2,self .nuni_layers): 
zl=zsl[-l] 

spl=sigmoid_prime(zl) 
deltal=np.dot(self.weightsl[-1+1]. 
transposeO, delta )*spl 
nabla_bl[-l]=deltal 
nabla_wl[-l]=np. 

dot (delta, activationsl [-1-1] .transposeO) 
return(nabla_bl,nabla_wl) 

def cost_derivative(self,output_activations,y): 

"""Return the vector of partial derivatives \ 
partial C_x / 
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\partial a for the output activations.""" 
return(output_activationsl-yl) 

def sigmoid(zl): 

"""The sigmoid function.""" 
Returnl.O/(l.O+np.exp(-zl)) 

def sigmoid_prime(zl): 

"""Derivative of the sigmoid function.""" 
Return sigmoid(z)*(l-sigmoid(zl)) 


Other Algorithms 

Many techniques are available to train neural networks besides 
backpropagation. One of the methods is to use common optimization 
algorithms such as gradient descent, Adam Optimizer, and so on. The 
simple perception method is also frequently applied. Hebb’s postulate 
is another popular method. In Hebb's learning, instead of the error, the 
product of the input and output goes as the feedback to correct the weight. 

w^(t + l) = w^(t) + ?7y.(t)x,.(t) 


TensorFlow 


TensorFlow is a popular deep learning library in Python. It is a Python 
wrapper on the original library. It supports parallelism on the CUDA- 
based GPU platform. The following code is an example of simple linear 
regression with TensorFlow: 

learning_rate = 0.0001 

y_t = tf.placeholder("float", [None,l]) 

x_t = tf.placeholder("float", [None,X_train.shape[l]]) 


106 


CHAPTER 5 DEEP LEARNING AND NEURAL NETWORKS 

W = tf .\/ariable(tf .random_normal( [X_train. 
shape[l],l],stddev=.Ol)) 
b = tf.constant(l.O) 

model = tf.matmul(x_t, W) + b 

cost_function = tf.reduce_sum(tf.pow((y_t - model),2)) 
optimizer = tf.train.AdamOptimizer(learning_rate). 
minimize(cost_function) 

init = tf.initialize_all_variables() 

with tf.SessionO as sess: 
sess.run(init) 
w = W.eval(session = sess) 
of = b.eval(session = sess) 

print("Before Training ######################## 

#########################") 
print(w,of) 

print("####################################### 

##########################") 
step = 0 
previous = 0 
while(l): 

step = step + 1 

sess.run(optimizer, feed_dict={x_t: X_ 
train.reshape(X_train.shape[0],X_train. 
shape[l]), y_t: y_train.reshape(y_ 
train.shape[o],l)}) 
cost = sess.run(cost_function, feed_ 
dict={x_t: X_train.reshape(X_train. 
shape[o],X_train.shape[l]), y_t: y_ 
train.reshape(y_train.shape[0],l)}) 
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if step%1000 == 0: 

print(cost) 

if((previous- cost) < .0001): 
break 

previous = cost 
w = W.eval(session = sess) 
of = b.eval(session = sess) 

print("Before Training ######################## 

#########################") 
print(w,of) 

print("####################################### 
##########################") 

With a little change, you can make it multilayer linear regression, as 
shown here: 

learning_rate = 0.0001 

y_t = tf.placeholder("float", [None,l]) 
if not multilayer: 

x_t = tf.placeholder("float", [None,X_train. 
shape[l]]) 

W = tf.Variable(tf.random_normal([X_train. 

shape[l],l],stddev=.Ol)) 

b = tf.constant(O.O) 

model = tf.matmul(x_t, W) + b 

else: 

x_t_user = tf.placeholder("float", [None, 
X_train_user.shape[1]]) 

x_t_context = tf.placeholder("float", [None, 
X_train_context.shape[1]]) 
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W_user = tf .\/ariable(tf .random_nornial( [X_train_ 
User.shape[l],1],stddev=.01)) 

W_context = tf .Variable(tf .randoni_nornial( [X_ 
train_context.shape[1],1],stddev=.01)) 

W_out_user = tf .\/ariable(tf .random_ 
normal([l,l],stddev=.Ol)) 

W_out_context = tf .\/ariable(tf .random_ 
normal([1,1],stddev=.Ol)) 
model = tf.add(tf.matmul(tf.matmul(x_t_user, 
W_user),W_out_user),tf.matmul(tf.matmul(x_t_ 
context, W_context),W_out_context)) 

cost_function = tf.reduce_suni(tf.pow((y_t - model),2)) 
optimizer = tf.train.AdamOptimizer(learning_rate). 
minimize(cost_function) 

init = tf.initialize_all_variables() 

with tf.SessionO as sess: 
sess.run(init) 

print("Before Training ######################## 

#########################") 
step = 0 
previous = 0 
cost = 0 
while(l): 

step = step + 1 
if not multilayer: 

sess.run(optimizer, feed_ 
dict={x_t: X_train.reshape 
(X_train.shape[o],X_train. 
shape[l]), y_t: y_train. 
reshape(y_train.shape[o],l)}) 
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cost = sess.run(cost_function, 
feed_dict={x_t: X_train.reshape 
(X_train.shape[o],X_train. 
shape[l]), y_t: y_train.reshape 
(y_train.shape[o],l)}) 

else: 

sess.run(optimizer, feed_ 
dict={x_t_user: X_train_ 
user.reshape(X_train_user. 
shape[o],X_train_user.shape[l]), 
x_t_context: X_train_context. 
reshape(X_train_context. 
shape[o],X_train_context. 
shape[l]), y_t: y_train. 
reshape(y_train.shape[o],l)}) 
cost = sess.run(cost_function, 
feed_dict={x_t_user: X_train_user. 
reshape(X_train_user.shape[o],X_ 
train_user.shape[l]), x_t_context: 
X_train_context.reshape(X_train_ 
context.shape[0],X_train_context. 
shape[l]), y_t: y_train.reshape(y_ 
train.shape[o],l)}) 
if step%1000 == 0: 

print(cost) 

if previous == cost or step > 50000: 
break 

if cost != cost : 

raise Exception("NaN value") 
previous = cost 
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print("###################################### 

###########################") 

if multilayer: 

w_user = W_user.eval(session = sess) 
w_context = W_context.eval(session = sess) 
w_out_context = W_out_context.eval(session 
= sess) 

w_out_user = W_out_user.eval(session = sess) 
w_user = np.dot(w_user, w_out_user) 
w_context = np.dot(w_context, w_out_context) 

else: 

w = W.eval(session = sess) 

You can do logistic regresson with the same code with a littie change, 
as shown here: 

learning_rate = 0.001 
no_of_level = 2 

y_t = tf.placeholder("float", [None,no_of_level]) 
if True: 

x_t = tf.placeholder("float", [None,X_train. 
shape[l]]) 

W = tf .\/ariable(tf .random_nornial( [X_train. 
shape[1],no_of_level],stddev=.01)) 
model = tf.nn.softmax(tf.matmul(x_t, W)) 

cost_function = tf.reduce_mean(-tf.reduce_sum(y_t*tf. 
log(model), reduction_indices=l)) 
optimizer = tf.train.GradientDescentOptiniizer(learni 
ng_rate).minimize(cost_function) 

init = tf.initialize_all_variables() 
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with tf.SessionO as sess: 
sess.run(init) 

print("Before Training ######################## 

#########################") 
step = 0 
previous = 0 
cost = 0 
while(l): 

step = step + 1 
if True: 

sess.run(optimizer, feed_ 
dict={x_t: X_train.reshape 
(X_train.shape[o],X_train. 
shape[l]), y_t: y_train. 
reshape(y_train.shape[o], 
no_of_level)}) 

cost = sess.run(cost_function, 
feed_dict={x_t: X_train.reshape(X_ 
train.shape[o],X_train.shape[l]), 
y_t: y_train.reshape(y_train. 
shape[o],no_of_level)}) 
if step%1000 == 0: 

print(cost) 

if previous == cost or step > 50000: 
break 

if cost != cost : 

raise Exception("NaN value") 
previous = cost 

print("###################################### 

###########################") 
if True: 

w = W.eval(session = sess) 
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Recurrent Neural NetWork 


A recurrent neural network is an extremely popular kind of network 
where the output of the previous step goes to the feedback or is input 
to the hidden layer. It is an extremely useful solution for a problem like 
a sequence leveling algorithm or time-series prediction. One of the 
more popular applications of the sequence leveling algorithm is in an 
autocomplete feature of a search engine. 

As an example, say one algorithmic trader wants to predict the price 
of a stock for trading. But his strategy requires the following criteria for 
prediction: 

a) The predicted tick is higher than the current tick and 
the next tick. Win. 

b) The predicted tick is lower than the current tick and 
the next tick. Win. 

c) The predicted tick is higher than the current tick but 
lower than the next tick. Loss. 

d) The predicted tick is lower than the current tick but 
higher than the next tick. Loss. 

To satisfy his criteria, the developer takes the following strategy. 

For generating predictions for 100 records, he is considering preceding 
1,000 records as input 1, prediction errors in the last 1,000 records as input 
2, and differences between two consecutive records as input 3. Using these 
inputs, an RNN-based engine predicts results, errors, and inter-record 
differences for the next 100 records. 

Then he takes the following strategy: 

If predicted diff > 1 and predicted err < 1, then 
prediction +- pred_err + 1. 

If predicted diff < 1 and predicted err > 1, then 
prediction -- pred_err -1. 
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In this way, prediction satisfies the developer need. The detailed code 
is shown next. It is using Keras, which is a wrapper above TensorFlow. 

import matplotlib.pyplot as plt 
import numpy as np 
import time 
import csv 

from keras.layers.core import Dense, Activation, Dropout 
from keras.layers.recurrent import LSTM 
from keras.models import Sequential 
import sys 

np.random.seed(l234) 

def read_data(path_to_dataset, 

sequence_length=50, 
ratio=1.0): 

max_values = ratio * 2049280 

with open(path_to_dataset) as f: 

data = csv.reader(f, delimiter=",") 
power = [] 
nb_of_values = 0 
for line in data: 

#print(line) 

#if nb_of_values == 3500: 

# break 
try: 

power.append(float(line[1])) 
nb_of_values += 1 
except ValueError: 
pass 
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# 2049280.0 is the total number of valid values, 
i.e. ratio = 1.0 
if nb_of_values >= max_values: 
break 

return power 

def process_data(power, sequence_length, ratio, error): 
#print("Data loaded from csv. Formatting...") 

#fig = plt.figureO 
#plt.plot(power) 

#plt.show() 
resuit = [] 

for index in range(len(power) - sequence_length): 

resuit.append(power[index: index + sequence_length]) 
resuit = np.array(result) # shape (2049230, 50 ) 

if not error: 

global result_mean, result_std 
result_mean = resuit.mean() 
result_std = resuit.std() 
resuit -= result_mean 
resuit /= result_std 
#result = np.log(result+l) 

#print resuit 
#exit(o) 

# print ("Shift : ", result_mean) 
print ("Data : ", resuit.shape) 

row = int(round(0.9 * resuit.shape[o])) 
print row 

train = resuit[:row, :] 
np.random.shuffle(train) 

X_train = train[:, :-l] 
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y_train = train[:, -1] 

X_test = resuit[row:, :-l] 
y_test = resuit[row:, -1] 

X_train = np.reshape(X_train, (X_train.shape[o], X_train. 
shape[l], 1)) 

X_test = np.reshape(X_test, (X_test.shape[o], X_test. 
shape[l], 1)) 

return [X_train, y_train, X_test, y_test] 

def build_model(): 

model = SequentialO 
layers = [l, 50, 100, 1] 

model.add(LSTM( 
layers[1], 

input_shape=(None, layers[o]), 
return_sequences=True)) 
model.add(Dropout(0.2)) 

model.add(LSTM( 
layers[2], 

return_sequences=False)) 
model.add(Dropout(0.2)) 

model.add(Dense( 
layers[3])) 

model.add(Activation("linear")) 
start = time.timeO 

model.compile(loss="mse", optimizer="rmsprop") 
print ("Compilation Time : ", time.time() - start) 
return model 
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def run_network(model=None, data=None, error=False): 
global_start_time = time.time() 
epochs = 2 
ratio =0.5 
sequence_length = 100 

X_train, y_train, X_test, y_test = process_data( 
data, sequence_length, ratio,error) 

print ('\nData Loaded. Compiling...\n') 

if model is None: 

model = build_model() 

try: 

model.fit( 

X_train, y_train, 

batch_size=512, nb_epoch=epochs, validation_ 
split=0.05) 

predicted = model.predict(X_test) 
predicted = np.reshape(predicted, (predicted.size,)) 
except Keyboardinterrupt: 

print ('Training duration (s) : ', time.time() - 

global_start_time) 

return model, y_test, 0 

try: 

fig = plt.figureO 

ax = fig.add_subplot(lll) 

ax.plot(y_test[:100]*result_max) 

plt.plot(predicted[:100]*result_max) 

plt.showO 
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except Exception as e: 
print (str(e)) 

print ('Training duration (s) : time.time() - global_ 

start_time) 

return model, y_test, predicted 

if _name_ == '_main_': 

path_to_dataset = '20170301_ltp.csv' 
data = read_data(path_to_dataset) 
error = [] 
diff_predicted = [] 
err_predicted = [] 
print len(data) 

for i in range(0,len(data)-1000,89): 
d = data[i:i+1000] 

model, y_test, predicted = run_network(None,d, False) 
if i > 11 and len(error) >= 1000: 

model,err_test, err_predicted = 
run_network(None,error, True) 
error = error[90:] 
dl = data[i:i+100l] 
diff = [0]*1000 
for k in range(lOOO): 

diff[k] = di[k+i] - di[k] 
niodel,diff_test, diff_predicted = 
run_network(None,diff, True) 
print i,len(d), len(y_test) 
y_test *= result_std 
predicted *= result_std 
y_test += result_mean 
predicted += result_mean 
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e = (y_test - predicted)/predicted 
error = np.concatenate([error, e]) 

#print error 

#error.append(y_test - predicted) 
if i > 11 and len(error) >= 1000 and len(err_ 
predicted)>=90: 

for j in range(len(y_test)-l): 

if diff_predicted[j] > 1 and err_ 
predicted[j]*predicted[j] <= 1: 
predicted[j] += abs(err_ 
predicted[j]*predicted[j]) + 1 
if diff_predicted[j] <= 1 and err_ 
predicted[j]*predicted[j] > 1: 

predicted[j] -= abs(err_ 
predicted[j]*predicted[j]) - 1 
print y_test[j], ' / ,predicted[j] 
print "length of error",len(error) 
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Time Series 


A time series is a series of data points arranged chronologically. Most 
commonly, the time points are equally spaced. A few examples are the 
passenger loads of an airline recorded each month for the past two years 
or the price of an instrument in the share market recorded each day for the 
last year. The primary aim of time-series analysis is to predict the future 
value of a parameter hased on its past data. 


Classification of Variation 


Traditionally time-series analysis divides the variation into three major 
components, namely, trends, seasonal variations, and other cyclic 
changes. The variation that remains is attrihuted to "irregular" fluctuations 
or error term. This approach is particularly valuahle when the variation is 
mostly comprised of trends and seasonality. 


Analyzing a Series Containing a Trend 

A trend is a change in the mean level that is long-term in nature. For 
example, if you have a series like 2, 4, 6, 8 and someone asks you for the 
next value, the ohvious answer is 10. You can justify your answer hy fitting 
a line to the data using the simple least square estimation or any other 
regression method. A trend can also he nonlinear. Figure 6-1 shows an 
example of a time series with trends. 
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International airiine passenger (thousands) 



Figure 6-1. A time series with trends 

The simplest type of time series is the familiar “linear trend plus noise" 
for which the ohservation at time t is a random variahle X„ as follows: 

X, =a + /it + e, 

Here, a,p are constants, and et denotes a random error term with 
a mean of 0. The average level at time t is given hy mt- {a + pt). This is 
sometimes called the trend term. 

Curve Fitting 

Fitting a simple function of time such as a polynomial curve (linear, 
quadratic, etc.), a Gompertz curve, or a logistic curve is a well-known 
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method of dealing with nonseasonal data that contains a trend, 
particularly yearly data. The global linear trend is the simplest type of 
polynomial curve. The Gompertz curve can be written in the following 
format, where a, fi, and y are parameters with 0 < r < 1: 

X, - aexp [fi exp(-Yt)] 

This looks quite different but is actually equivalent, provided y > 0. The 
logistic curve is as follows: 

Xt - a / (l+be ‘^') 

Both these curves are S-shaped and approach an asymptotic value 
as t^oo, with the Gompertz curve generally converging slower than the 
logistic one. Fitting the curves to data may lead to nonlinear simultaneous 
equations. 

For all curves of this nature, the fitted function provides a measure 
of the trend, and the residuals provide an estimate of local fluctuations 
where the residuals are the differences between the observations and the 
corresponding values of the fitted curve. 

Removing Trends from a Time Series 

Differentiating a given time series until it becomes stationary is a special 
type of filtering that is particularly useful for removing a trend. You will 
see that this is an integral part of the Box-Jenkins procedure. For data with 
a linear trend, a first-order differencing is usually enough to remove the 
trend. 

Mathematically, it looks like this: 

y(t) - a*t + c 
y(t+l) - a*(t+l) + c 

z(t) - y(t+l) -y(t) - a + c; no trend present in z(t) 


123 


CHAPTER 6 TIME SERIES 


A trend can be exponential as well. In this case, you will have to do a 
logarithmic transformation to convert the trend from exponential to linear. 
Mathematically, it looks like this: 

y(t) - a*exp(t) 

z(t) - log(y(t)) - t*log(a); z(t) is a linear function of t 


Analyzing a Series Containing Seasonality 

Many time series, such as airline passenger loads or weather readings, 
display variations that repeat after a specific time period. For instance, in 
India, there will always be an increase in airline passenger loads during 
the holiday of Diwali. This yearly variation is easy to understand and can 
be estimated if seasonality is of direct interest. Similarly, like trends, if you 
have a series such as 1, 2,1, 2,1, 2, your obvious choices for the next values 
of the series will be 1 and 2. 

The Holt-Winters model is a popular model to realize time series with 
seasonality and is also known as exponential smoothing. The Holt-Winters 
model has two variations: additive and multiplicative. In the additive 
model with a single exponential smoothing time series, seasonality is 
realized as follows: 


X(t+1) = a *Xt + (!-«)* St-1 

In this model, every point is realized as a weighted average of the 
previous point and seasonality. So, X(t+1) will be calculated as a function 
X(t-l) and S(t-2) and square of a. In this way, the more you go on, the 
a value increases exponentially. This is why it is known as exponential 
smoothing. The starting value of St is crucial in this method. Commonly, 
this value starts with a 1 or with an average of the first four observations. 

The multiplicative seasonal model time series is as follows: 

X(t+I)- (bl + b2*t)St + noise. 
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Here, bl, often referred to as the permanent component, is the initial 
weight of the seasonality; b2 represents the trend, which is linear in this 
case. 

However, there is no Standard implementation of the Holt-Winters 
model in Python. It is available in R (see Chapter 1 for how R’s Holt- 
Winters model can be called from Python code). 


Removing Seasonality from a Time Series 

There are two ways of removing seasonality from a time series. 

• By filtering 

• By differencing 

By Filtering 

The series {x,} is converted into another called {y,} with the linear operation 
shown here, where {aj is a set ofweights: 

Yt ^ i;^\=-qarXt« 

To smooth out local fluctuations and estimate the local mean, you 
should clearly choose the weights so that X - 1; then the operation is 
often referred to as a moving average. They are often symmetric with 
s - q and aj - a.j. The simplest example of a symmetric smoothing filter is 
the simple moving average, for which a^ - 1 / (2q+l) for r - -q,..., + q. 

The smoothed value of x, is given by the following: 
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The simple moving average is useful for removing seasonal variatioris, 
but it is unable to deal well with trends. 

By Differencing 

Differencing is widely used and often works well. Seasonal differencing 
removes seasonal variation. 

Mathematically, if time series y(t) contains additive seasonality S(t) 
with time period T, then: 

y(t) - a*S(t) + b*t + c 
y(t+T) - aS(t+T) + b*(t+T) + c 
z(t) - y(t+T) - y(t) - b*T + noise term 

Similar to trends, you can convert the multiplicative seasonality to 
additive by log transformation. 

Now, finding time period T in a time series is the critical part. It can 
be done in two ways, either by using an autocorrelation function in the 
time domain or by using the Fourier transform in the frequency domain. 
In both cases, you will see a spike in the plot. For autocorrelation, the 
plot spike will be at lag T, whereas for FT distribution, the spike will be at 
frequency I/T. 


Transformation 


Up to now I have discussed the various kinds of transformation in a time 
series. The three main reasons for making a transformation are covered in 
the next sections. 

To Stabilize the Variance 

The Standard way to do this is to take a logarithmic transformation of the 
series; it brings closer the points in space that are widely scattered. 
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To Make the Seasonal Effect Additive 

If the series has a trend and the volume of the seasonal effect appears to he 
on the rise with the mean, then it may he advisahle to modify the data so as 
to make the seasonal effect constant fromyear to year. This seasonal effect 
is said to he additive. However, if the volume of the seasonal effect is directly 
proportional to the mean, then the seasonal effect is said to he multiplicative, 
and a logarithmic transformation is needed to make it additive again. 

To Make the Data Distributiori Normai 

In most prohahility models, it is assumed that distrihution of data is 
Gaussian or normai. For example, there can he evidence of skewness in a 
trend that causes "spikes" in the time plot that are all in the same direction. 

To transform the data in a normai distrihution, the most common 
transform is to suhtract the mean and then divide hy the Standard 
deviation. I gave an example of this transformation in the RNN example in 
Chapter 5; Tll give another in the final example of the current chapter. The 
logic hehind this transformation is it makes the mean 0 and the Standard 
deviation 1, which is a characteristic of a normai distrihution. Another 
popular transformation is to use the logarithm. The major advantage of a 
logarithm is it reduces the variation and logarithm of Gaussian distrihution 
data that is also Gaussian. Transformation may he prohlem-specific or 
domain-specific. For instance, in a time series of an airline’s passenger 
load data, the series can he normalized hy dividing hy the numher of days 
in the month or hy the numher of holidays in a month. 

Cyclic Variation 

In some time series, seasonality is not a constant hut a stochastic variahle. 
That is known as cyclic variation. In this case, the periodicity first has to 
he predicted and then has to he removed in the same way as done for 
seasonal variation. 
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Irregular Fluctuations 

A time series without trends and cyclic variations can be realized as a 
weekly stationary time series. In the next section, you will examine various 
probabilistic models to realize weekly time series. 


Stationary Time Series 

Normally, a time series is said to be stationary if there is no systematic 
change in mean and variance and if strictly periodic variations have 
been done away with. In real life, there are no stationary time series. 
Whatever data you receive by using transformations, you may try to make 
it somehow nearer to a stationary series. 

Stationary Process 

A time series is strictly stationary if the joint distribution of X(ti),...,X(tk) is 
the same as the joint distribution of X(ti + T),...,X(tk + x) for ali ti,...,tk,x. If k 
-I, striet stationary implies that the distribution of X(t) is the same for all t, 
so provided the first two moments are finite, you have the following: 

^(t) ^ n 

CT^Ct) - 

They are both constants, which do not depend on the value of t. 

A weekly stationary time series is a stochastic process where the mean 
is constant and autocovariance is a function of time lag. 
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Autocorrelation and the Correlogram 


Quantities called sample autocorrelation coefftcients act as an important 
guide to the properties of a time series. They evaluate the correlation, 
if any, hetween ohservations at different distances apart and provide 
valuahle descriptive information. You will see that they are also an 
important tool in model huilding and often provide valuahle clues for 
a suitahle prohahility model for a given set of data. The quantity lies in 
the range [-1,1] and measures the forcefulness of the linear association 
hetween the two variahles. It can he easily shown that the value does 
not depend on the units in which the two variahles are measured; if the 
variahles are independent, then the ideal correlation is zero. 

A helpful supplement in interpreting a set of autocorrelation 
coefficients is a graph called a correlogram. The correlogram may he 
alternatively called the sample autocorrelation function. 

Suppose a stationary stochastic process X(t) has a mean p, variance 
auto covariance function (acv.f.) y[t), and auto correlation fimction (ac.f.) p(t). 



Estimating Autocovariance and Autocorrelation 
Functions 


In the stochastic process, the autocovariance is the covariance of the 
process with itself at pairs of time points. Autocovariance is calculated as 
follows: 



n-\h\ 


t=l 
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Figure 6-2 shows a sample autocorrelation distributiori. 


Autocorrelation 



6 26 30 M 


Figure 6-2. Sample autocorrelations 

Time-Series Analysis with Python 

A complement to SciPy for statistical computations including descriptive 
statistics and estimation of statistical models is provided by Statsmodels, 
which is a Python package. Besides the early models, linear regression, 
robust linear models, generalized linear models, and models for discrete 
data, the latest release of scikits.statsmodels includes some basic tools and 
models for time-series analysis, such as descriptive statistics, statistical 
tests, and several linear model classes. The linear model classes include 
autoregressive (AR), autoregressive moving-average (ARMA), and vector 
autoregressive models (VAR). 
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Let’s start with a moving average. 

Moving Average Process 
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Suppose that {Z,} is a purely random process with mean 0 and variance a^. 
Then a process {X,} is said to be a moving average process of order q. 

Here, {/?,} are constants. The Zs are usually scaled so that /?o - 1- 

£(X,) = 0 

Var(X,) = cT|XA' 


The Zs are independent. 


7(fc) = Cov(X„X,,,) 

= Cov(/i„Z, + ■ • • + PaZt+k + ■ • • + P^Zt+k-q ) 

0 k>q 


q-k 


<ylYPiPuk k^0,\,...,q 

z=0 

Y{-k) k<0 


Cov(Z,,Zj = 


I S — t 

0 s^t 


As yik) is not dependent on t and the mean is constant, the process is 
second-order stationary for all values of {/?,}. 
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1 fc = 0 


q-k q 



0 k>q 


p{-k) k<0 


Fitting Moving Average Process 

The moving-average (MA) model is a well-known approach for realizing a 
single-variable weekly stationary time series (see Figure 6-3). The moving- 
average model specifies that the output variahle is linearly dependant on 
its own previous error terms as well as on a stochastic term. The AR model 
is called the Moving-Average model, which is a special case and a key 
component of the ARMA and ARIMA models of time series. 

p q b 


Moving Average 



Actual 
lnterval = 6 


Figure 6-3. Example of moving average 
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Here’s the example code for a moving average model: 
import numpy as np 
def running_mean(l, N): 

# Also Works for the(strictly invalid) cases when N is even. 
if (N//2)*2 == N: 

N = N - 1 

front = np.zeros(N//2) 
back = np.zeros(N//2) 

for i in range(l, (N//2)*2, 2): 

front[i//2] = np.convolve(l[:i], np.ones((i,))/i, mode 
= 'valid') 

for i in range(l, (N//2)*2, 2): 

back[i//2] = np.convolve(l[-i:], np.ones((i,))/i, mode 
= 'valid') 

return np.concatenate([front, np.convolve(l, 
np.ones((N,))/N, mode = 'valid'), back[::-l]]) 

print running_mean(2,2l) 

Autoregressive Processes 

Suppose {Z,} is a purely random process with mean 0 and variance fx/. 
After that, a process {X,} is said to be of autoregressive process of order p if 
you have this: 


X, =aiX,_i+...apX,_p+z,, or 

p 

i=l 
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The autocovariance function is given by the following: 


*: 2 

hence 


y(0) 


Figure 6-4 shows a time series and its autocorrelation plot of the AR 
model. 



Figure 6-4. A time series and AR model 

Estimating Parameters of an AR Process 

A process is called weakly stationary if its mean is constant and the 
autocovariance function depends only on time lag. There is no weakly 
stationary process, but it is imposed on time-series data to do some 
stochastic analysis. Suppose Z(t) is a weak stationary process with mean 0 
and constant variance. Then X(t) is an autoregressive process of order p if 
you have the following: 

X(t) - al xX(t-l) + a2 xX(t-2) + ... + ap xX(t-p) +Z(t), where a e R and p e I 
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Now, E[X(t)] is the expected value ofX(t). 

Covariance(X(t),X(t+h)) - E[(X(t) - E[X(t)]) * (X(t+h) - E[X(t+h)])] 


- E[(X(t) - m) * (X(t+h) - m)] 

If X(t) is a weak stationary process, then: 

E[X(t)] - E[X(t+h)] - m (constant) 

-E[X(t)*X(t+h)] -m2-c(h) 

Here, m is constant, and cov[X(t),X(t+h)] is the function of only h(c(h)) 
for the weakly stationary process. c(h) is Icnown as autocovariance. 

Similarly, the correlation (X(t),X(t+h) - /7(h) - r(h) - c(h) - ^ c(0) is 
known as autocorrelation. 

If X(t) is a stationary process that is realized as an autoregressive 
model, then: 

X(t) - al * X(t-l) + a2 * X(t-2) +.+ ap * X(t-p) + Z(t) 

Correlation(X(t),X(t)) - al * correlation (X(t),X(t-l)) + .... + 
ap * correlation (X(t),X(t-p))+0 

As covariance, (X(t),X(t+h)) is dependent only on h, so: 

rO - al * rl + a2 * r2 + ... + ap * rp 
rl - al * rO + a2 * rl + .... + ap * r(p-l) 

So, for an n-order model, you can easily generate the n equation and 
from there find the n coefficient hy solving the n equation system. 
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In this case, realize the data sets only in the first-order and second- 
order autoregressive model and choose the model whose mean of residual 
is less. For that, the reduced formulae are as follows: 

• First order. al = rl 

• Secondorder: al-r^[l-r2)^{l-r^ya2-{^r2-r^^^{l-r^^ 

Here is some example code for an autoregressive model: 

from pandas import Series 

from matplotlib import pyplot 

from statsmodels.tsa.ar_model import AR 

from sklearn.metrics import mean_squared_error 

series = series.from_csv('input.csv', header=0) 

D = series.value 

train, test = D[l:len(D)-10], D[len(D)-10:] 

model = AR(train) 

model_fit = model.fit() 

print('Lag: %s' % model_fit.k_ar) 

print('Coefficients: %s' % model_fit.params) 

predictions = model_fit.predict(start=len(train), 
end=len(train)+len(test)-l, dynamic=False) 
for t in range(len(predictions)): 

print('predicted=%f, expected=%f' % (predictions[t], 
test[t])) 

error = mean_squared_error(test, predictions) 
print('Test MSE: %.3f' % error) 

pyplot.plot(test) 

pyplot.plot(predictions, color='red') 
pyplot.show() 
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Mixed ARMA modeis are a combination of MA and AR processes. A mixed 
autoregressive/moving average process containing p AR terms and q 
MA terms is said to be an ARMA process of order (p,q). R is given by the 
following: 


Xt - ajXj.i + ■ • • + ttpXj.p + Z, + + ■ • • + 

The following example code was taken from the stat model site to 
realize time-series data as an ARMA model: 

rl,ql,pl = sm.tsa.acf(resid.values.squeezeO, qstat=True) 

datal = np.c_[range(l,40), rl[l:], ql, pl] 

tablel = pandas.DataFrame(datal, columns=['lag', "AC", "0", 

"Prob(>Q)"]) 

predict_sunspotsl = arffla_mod40.predict('startyear', 'endyear', 
dynamic=True) 

Here is the simulated ARMA (4,1) model Identification code: 

from statsmodels. import tsa.arima_processimportarma_generate_ 
sample, ArmaProcess 

np.random.seed(l234) 

data = np.array([l, .85, -.43, -.63, .8]) 

parameter = np.array([l, .41] 

model = ArmaProcess(data, parameter) 

model. isinvertibleO 

True 

Model. isstationaryO 
True 
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Here is how to estimate parameters of an ARMA model: 

1. After specifying the order of a stationary ARMA 
process, you need to estimate the parameters. 

2. Assume the following: 

• The model order (p and q) is known. 

• The data has zero mean. 

3. If step 2 is not a reasonahle assumption, you can 
suhtract the sample mean Y and fit a 0 mean ARMA 
model, as in 0(B)X, - 0(B)at where X, - Y, - Y Then 
use X, + Y as the model for Y,. 


Integrated ARMA Modeis 

To fit a stationary model such as the one discussed earlier, it is imperative 
to remove nonstationary sources of variation. Differencing is widely used 
for econometric data. If X, is replaced hy V%, then you have a model 
capahle of descrihing certain types of nonstationary series. 


These are the estimating parameters of an ARIMA model: 

• ARIMA modeis are designated hy the level of 
autoregression, integration, and moving averages. 

• This does not assume any pattern uses an iterative 
approach of identifying a model. 

• The model ‘Tits’' if residuals are generally small, 
randomly distrihuted, and, in general, contain no 
useful information. 
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Here is the example code for an ARIMA model: 

from pandas import read_csv 

froffl pandas import datetime 

from matplotlib import pyplot 

from statsmodels.tsa.arima_model import ARIMA 

from sklearn.metrics import mean_squared_error 

def parser(p): 

return datetime.strptime('190'+p, '%Y-%m') 

series = read_csv('input.csv', header=0, parse_dates=[o], 
index_col=0, squeeze=True, date_parser=parser) 

P = series.values 

size = int(len(P) * 0.66) 

train, test = P[0:size], P[size:len(P)] 

history = [p for p in train] 

predictions = list() 

for t in range(len(test)): 

model = ARIMA(history, order=(5,l,0)) 

model_fit = model.fit(disp=0) 

output = model_fit.forecast() 

yhat = output[0] 

predictions.append(yhat) 

obs = test[t] 

history.append(obs) 

print('predicted=%f, expected=%f' % (yhat, obs)) 
error = mean_squared_error(test, predictions) 
print('Test MSE: %.3f' % error) 

# plot 

pyplot.plot(test) 

pyplot.plot(predictions, color='red') 
pyplot.show() 
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The Fourier Transform 


The representation of nonperiodic signals by everlasting exponential 
signals can be accomplished by a simple limiting process, and I will 
illustrate that nonperiodic signals can be expressed as a continuous sum 
(integral) of everlasting exponential signals. Sayyou want to represent the 
nonperiodic signal g(t). Realizing any nonperiodic signal as a periodic 
signal with an infinite time period, you get the following: 



G(nAft)) = lim [ gp{t)e 



00 




Hence: 


G(ffl)= jg(t)e 


G(w) is Icnown as a Fourier transform of g(t). 

Here is the relation between autocovariance and the Fourier 


transform: 


Tt 


7{0)^crl^ldF{Q))^F{7:) 


0 
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An Exceptional Scenario 

In the airline or hotel domain, the passenger load of month t is less 
correlated with data of t-1 or t-2 month, but it is more correlated for t-12 
month. For example, the passenger load in the month of Diwali (October) 
is more correlated with last year’s Diwali data than with the same year’s 
August and September data. Historically, the pick-up model is used to 
predict this kind of data. The pick-up model has two variations. 

In the additive pick-up model, 

X(t) - X(t-l) + [X(t-12) -X(t-13)] 

In the multiplicative pick-up model, 

X(t) - X(t-l) * [X(t-12) / X(t-13)] 

Studies have shown that for this kind of data the neural network-based 
predictor gives more accuracy than the time-series model. 

In high-frequency trading in investment banking, time-series models 
are too time-consuming to capture the latest pattern of the instrument. 

So, they on the fly calculate dX/dt and d2X/dt2, where X is the price of 
the instruments. If both are positive, they blindly send an order to buy the 
instrument. If both are negative, they blindly sell the instrument if they 
have it in their portfolio. But if they have an opposite sign, then they do a 
more detailed analysis using the time series data. 

As I stated earlier, there are many scenarios in time-series analysis 
where R is a better choice than Python. So, here is an example of time- 
series forecasting using R. The beauty of the auto.arima model is that it 
automatically finds the order, trends, and seasonality of the data and fits 
the model. In the forecast, we are printing only the mean value, but the 
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model provides the upper limit and the lower limit of the prediction in 
forecasting. 

asni_weel<wise<-read.csv("F:/souravda/New ASM Weekwise. 
csv",header=TRUE) 

asni_weel<wise$Week <- NULL 

library(MASS, lib.loc="F:/souravda/lib/") 
library(tseries, lib.loc="F:/souravda/lib/") 
library(forecast, lib.loc="F:/souravda/lib/") 

asni_weekwise[is.na(asni_weekwise)] <- 0 

asni_weekwise[asni_weekwise <= 0] <- mean(as.matrix(asm_weekwise)) 

weekjoyforecastvalues <- data.frame( "asm" = integer(), "value" 
= integerO, stringsAsFactors=FALSE) 

for(i in 2:ncol(asm_weekwise)) 

{ 

asmname<-naffles(asm_weekwise)[i] 
temparimadata<-asm_weekwise[,i] 
m <- mean(as.matrix(temparimadata)) 

#print(m) 

s <- sd(temparimadata) 

#print(s) 

temparimadata <- (temparimadata - m) 
temparimadata <- (temparimadata / s) 
temparima<-auto.arima(temparimadata, stationary = FALSE, 
seasonal = TRUE, allowdrift = TRUE, allowmean = FALSE, biasadj 
= FALSE) 

tempforecast<-forecast(temparima,h=12) 

#tempforecast <- (tempforecast * s) 

Sprint(tempforecast) 
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temp_forecasted_data<-suni(data.frame(tenipforecast$mean)*s + m) 
weekjoyforecastvalues[nrow(weekjoyforecastvalues) + 1 , ] <- 
c( asmname, temp_forecasted_data) 

} 

weekjoyforecastvalues$value<-as.integer(weekjoyforecastvalues$value) 

#weekjoyforecastvalues 

(suni(weekjoyforecastvalues$value)- 53782605)/53782605 
# 103000000)/103000000 

Missing Data 

One important aspect of time series and many other data analysis work is 
figuring out how to deal with missing data. In the previous code, you fili in 
the missing record with the average value. This is fine when the numher of 
missing data instances is not very high. But if it is high, then the average of 
the highest and lowest values is a hetter alternative. 
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Analytics at Scale 

In recent decades, a revolutionary change has taken place in the field of 
analytics technology because of big data. Data is being collected from a 
variety of sources, so technology has been developed to analyze this data 
in a distributed environment, even in real time. 


Hadoop 

The revolution started with the development of the Hadoop framework, 
which has two major components, namely, MapReduce programming and 
the HDFS file System. 

MapReduce Programming 

MapReduce is a programming style inspired by functional programming 
to deal with large amounts of data. The programmer can process big data 
using MapReduce code without knowing the internals of the distributed 
environment. Before MapReduce, frameworks like Condor did parallel 
computing on distributed data. But the main advantage of MapReduce 
is that it is RFC based. The data does not move; on the contrary, the code 
jumps to different machines to process the data. In the case of big data, it is 
a huge savings of network bandwidth as well as computational time. 


© Sayan Mukhopadhyay 2018 

S. Mukhopadhyay, Advanced Data Analytics Using Python, 
https://dol.org/10.1007/978-l-4842-3450-l_7 


145 


CHAPTER 7 ANALYTICS AT SCALE 


A MapReduce program has two major components: the mapper and 
the reducer. In the mapper, the input is split into small units. Generally, 
each line of input file hecomes an input for each map joh. The mapper 
processes the input and emits a key-value pair to the reducer. The reducer 
receives all the values for a particular key as input and processes the data 
for final output. 

The following pseudocode is an example of counting the frequency of 
words in a document: 

map(String key, String value): 

// key: document name 
// value: document contents 
for each word w in value: 

Emitlntermediate(w, "1"); 

reduce(String key, Iterator values): 

// key: a word 
// values: a list of counts 
int resuit = 0; 
for each v in values: 
resuit += Parselnt(v); 

Emit(AsString(result)); 

Partitioning Function 

Sometimes it is required to send a particular data set to a particular reduce 
joh. The partitioning function solves this purpose. For example, in the 
previous MapReduce example, say the user wants the output to he stored 
in sorted order. Then he mentions the numher of the reduce joh 32 for 32 
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alphabets, and in the practitioner he returns 1 for the key starting with a, 2 
for b, and so on. Tben all tbe words tbat start witb tbe same letters and go 
to tbe same reduce job. Tbe output will be stored in tbe same output file, 
and because MapReduce assures tbat tbe intermediate key-value pairs are 
processed in increasing key order, witbin a given partition, tbe output will 
be stored in sorted order. 

Combiner Function 

Tbe combiner is a facility in MapReduce wbere partial aggregation is 
done in tbe map pbase. Not only does it increase tbe performance, but 
sometimes it is essential to use if tbe data set so buge tbat tbe reducer is 
tbrowing a stack overflow exception. Usually tbe reducer and combiner 
logic are tbe same, but tbis migbt be necessary depending on bow 
MapReduce deals witb tbe output. 

To implement tbis word count example, we will follow a particular 
design pattern. Tbere will be a root RootBDAS (BDAS stands for Big Data 
Analytic System) class tbat bas two abstract metbods: a mapper task and a 
reducer task. All cbild classes implement tbese mapper and reducer tasks. 
Tbe main class will create an instance of tbe cbild class using reflection, 
and in MapReduce map functions call tbe mapper task of tbe instance 
and tbe reducer function of tbe reducer task. Tbe major advantages of tbis 
pattern are tbatyou can do unit testing of tbe MapReduce functionality 
and tbat it is adaptive. Any new cbild class addition does not require any 
cbanges in tbe main class or unit testing. You just bave to cbange tbe 
configuration. Some code may need to implement combiner or partitioner 
logics. Tbey bave to inberit tbe ICombiner or IPartitioner interface. 
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Figure 7-1 shows a class diagram of the System. 



Figure 7-1. The class diagram 


Here is the RootBDAS class: 

import java.util.ArrayList; 
import java.util.HashMap; 

* 

*/ 




* 

* 


*/ 


@author SayanM 
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public abstract class RootBDAS { 

abstract HashMap<String, ArrayList<String>> 
mapper_task(String line); 
abstract HashMap<String, ArrayList<String>> 
reducer_task(String key, ArrayList<String> values); 

} 


Here is the child class: 

import java.util.ArrayList; 
import java.util.HashMap; 

* 

*/ 


* @author SayanM 

* 

*/ 

public final class WordCounterBDAS extends RootBDAS{ 

@0verride 

HashMapcString, ArrayList<String>> mapper_task 
(String line) { 

// TODO Auto-generated method stub 
String[] words = line.split(" "); 
HashMapcString, ArrayList<String>> resuit = new 
HashMapcString, ArrayList<String>>(); 
for(String w : words) 

{ 

if(resuit.containsKey(w)) 
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{ 

ArrayList<String> vals = resuit. 
get(w); 

vals.add("l"); 
resuit.put(w, vals); 

} 

else 

{ 

ArrayList<String> vals = new 
ArrayList<String>(); 
vals.add("l"); 
resuit.put(w, vals); 

} 

} 

return resuit; 

} 

@Override 

HashMap<String, ArrayList<String>> reducer_task 
(String key, ArrayList<String> values) { 

// TODO Auto-generated method stub 
HashMap<String, ArrayList<String>> resuit = new 
HashMap<String, ArrayList<String>>(); 
ArrayList<String> tempres = new ArrayList 
<String>(); 

tempres.add(values.size()+ ""); 
resuit.put(key, tempres); 
return resuit; 

} 

} 
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Here is the WordCounterBDAS utility class: 

import java.util.ArrayList; 
import java.util.HashMap; 

* 

*/ 


* @author SayanM 


public final class WordCounterBDAS extends RootBDAS{ 

@0verride 

HashMap<String, ArrayList<String>> mapper_tasl< 

(String line) { 

// TODO Auto-generated method stub 
String[] words = line.split(" "); 
HashMap<String, ArrayList<String>> resuit = new 
HashMap<String, ArrayList<String>>(); 
for(String w : words) 

{ 

if(resuit.containsKey(w)) 

{ 

ArrayList<String> vals = resuit. 
get(w); 

vals.add("l"); 
resuit.put(w, vals); 

} 

else 
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{ 

ArrayList<String> vals = new 
ArrayList<String>(); 
vals.add("l"); 
resuit.put(w, vals); 

} 

} 

return resuit; 

} 

@Override 

HashMap<String, ArrayList<String>> reducer_task 
(String key, ArrayList<String> values) { 

// TODO Auto-generated method stub 
HashMap<String, ArrayList<String>> resuit = new 
HashMap<String, ArrayList<String>>(); 

ArrayList<String> tempres = new 
ArrayList<String>(); 
tempres.add(values.size()+ ""); 
resuit.put(key, tempres); 
return resuit; 

} 

} 

Here is the MainBDAS class: 

import java.io.IOException; 
import java.util.ArrayList; 
import java.util.HashMap; 

import org.apache.hadoop.conf.Configuration; 

import org.apache.hadoop.fs.Path; 

import org.apache.hadoop.io.LongWritable; 
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import org.apache.hadoop.io.Text; 

import org.apache.hadoop.mapreduce.Dob; 

import org.apache.hadoop.mapreduce.Mapper; 

import org.apache.hadoop.mapreduce.Reducer; 

import org.apache.hadoop.mapreduce.lib.input.FileinputFormat; 

import org.apache.hadoop.mapreduce.lib.input.TextinputFormat; 

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 

* 

*/ 

* @author SayanM 

* 

*/ 

public class MainBDAS { 

public static class MapperBDAS extends 
MappercLongWritable, Text, Text, Text> { 


protected void map(LongWritable key, Text value, 
Context context) 

throws lOException, Interrupted 
Exception { 

String classname = context. 
getConfiguration().get("classname"); 

try { 

RootBDAS instance = (RootBDAS) 
Class.forName(classname). 
getConstructorO .newInstanceO; 
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String line = value.toStringO; 
HashMap<String, ArrayList<String>> 
resuit = instance.mapper_task(line); 
for(String k : resuit.keySet()) 

{ 

for(String v : resuit.get(k)) 

{ 

context.write(new 
Text(k), new Text(v)); 

} 

} 

} catch (Exception e) { 

// TODO Auto-generated catch block 
e.printStackTraceO; 

} 

} 


} 

public static class ReducerBDAS extendsReducer<Text, 
Text, Text, Text> { 

protected void reduce(Text key, Iterable<Text> 
values, 

Context context) throws lOException, 
InterruptedException { 

String classname = context. 
getConfiguration().get("classname"); 

try { 
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RootBDAS instance = (RootBDAS) 
Class.forName(classname). 
getConstructorO .newInstanceO; 
ArrayList<String> vals = new 
ArrayList<String>(); 
for(Text v : values) 

{ 

vals.add(v.toString()); 

} 

HashMap<String, ArrayList<String>> 
resuit = instance.reducer_task(key. 
toStringO, vals); 
for(String k : resuit.keySet()) 

{ 

for(String v : resuit.get(k)) 

{ 

context.write(new 
Text(k), new Text(v)); 

} 

} 

} catch (Exception e) { 

// TODO Auto-generated catch block 
e.printStackTraceO; 

} 

} 


} 
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{ 


public static void main(String[] args) throws Exception 

// TODO Auto-generated method stub 

String classname = Utility.getClassName(Utility. 
configpath); 

Configuration con = new Configuration(); 
con.set("classname", classname); 

lob job = new lob(con); 

job.setlarByClass(MainBDAS.class); 

job.setlobName("MapReduceBDAS"); 

job.setOutputKeyClass(Text.class); 
job.setOutput\/alueClass(Text.class); 

job.setInputFormatClass(TextInputFormat.class); 
job.setOutputFormatClass(TextOutputFormat.class); 

FileInputFormat.setInputPaths(job, new 
Path(args[0])); 

FileOutputFormat.setOutputPath(job, new 
Path(args[l])); 

job.setMapperClass(MapperBDAS.class); 
job.setReducerClass(ReducerBDAS.class); 

System.out.println(j ob.waitForCompletion(true)); 


} 


} 
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To test the example, you can use this unit testing class: 

import static org.junit.Assert.*; 

import java.util.ArrayList; 
import java.util.HashMap; 

import org.junit.Test; 

public class testBDAS { 

@Test 

public void testMapper() throws Exception{ 

String classname = Utility.getClassName(Utility. 
testconfigpath); 

RootBDAS instance = (RootBDAS) Class. 
forName(classname).getConstructorO. 
newInstanceO; 

String line = Utility.getMapperInput(Utility. 
testconfigpath); 

HashMapcString, ArrayList<String>> actualresult = 
instance.mapper_task(line); 

HashMapcString, ArrayList<String>> expectedresult 
= Utility.getMapOutput(Utility.testconfigpath); 

for(String key : actualresult.keySet()) 

{ 

boolean haskey = expectedresult. 
containsKey(key); 
assertEquals(true, haskey); 
ArrayList<String> actvals = actualresult. 
get(key); 

for(String v : actvals) 

{ 
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boolean hasval = expectedresult. 
get(l<ey) .contains(v); 
assertEquals(true, hasval); 

} 

} 

} 

@Test 

public void testReducer(){ 
fail(); 

} 

} 

Finally, here are the interfaces: 

import java.util.ArrayList; 
import java.util.HashMap; 

public interface ICombiner { 

HashMap<String, ArrayList<String>> cofflbiner_task(String 
key, ArrayList<String> values); 

} 

public interface IPartitioner { 

public int partitioner_task(String line); 


} 
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HDFS File System 

Other than MapReduce, HDFS is the second component in the 
Hadoop framework. It is designed to deal with big data in a distributed 
environment for general-purpose low-cost hardware. HDFS is built on top 
of the Unix POSSIX file System with some modifications, with the goal of 
dealing with streaming data. 

The Hadoop cluster consists of two types of host: the name node and 
the data node. The name node Stores the metadata, Controls execution, 
and acts like the master of the cluster. The data node does the actual 
execution; it acts like a slave and performs instructions sent by the name 
node. 

MapReduce Design Pattern 

MapReduce is an archetype for processing the data that resides in 
hundreds of computers. There are some design patterns that are common 
in MapReduce programming. 

Summarization Pattern 

In summary, the reducer creates the summary for each key (see Figure 7-2). 
The practitioner can be used if you want to sort the data or for any other 
purpose. The word count is an example of the summarizer pattern. This 
pattern can be used to find the minimum, maximum, and count of data or 
to find the average, median, and Standard deviation. 
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Figure 7-2. Details ofthe summarization pattern 

Filtering Pattern 

In MapReduce filtering is done in a divide-and-conquer way (Figure 7-3). 
Each mapper job filters a subset of data, and the reducer aggregates the 
filtered subset and produces the final output. Generating the top N records, 
searching data, and sampling data are the common use cases of the 
filtering pattern. 



Figure 7-3. Details ofthe filtering pattern 


160 
































































CHAPTER 7 ANALYTICS AT SCALE 


Join Patterns 

In MapReduce, joining (Figure 7-4) can be done on the map side or the 
reduce side. For the map side, the join data sets that will be joined should 
exist in the same cluster; otherwise, the reduce-side join is required. The 
join can be an outer join, inner join, or anti-join. 



Key, Valuel. Value2 


Figure 7-4. Details ofthe join pattern 


The following code is an example of the reducer-side join: 
package MapreduceDoin; 

import java.io.IOException; 
import java.util.ArrayList; 
import java.util.Iterator; 

import org.apache.hadoop.fs.Path; 

import org.apache.hadoop.io.LongWritable; 

import org.apache.hadoop.io.Text; 

import org.apache.hadoop.mapred.DobConf; 

import org.apache.hadoop.mapred.MapReduceBase; 

import org.apache.hadoop.mapred.OutputCollector; 

import org.apache.hadoop.mapred.Reporter; 

import org.apache.hadoop.mapred.lib.MultipleInputs; 

import org.apache.hadoop.mapreduce.Iob; 

import org.apache.hadoop.mapred.Mapper; 

import org.apache.hadoop.mapred.Reducer; 
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import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapred.TextInputFormat; 

@SuppressWarnings( "deprecatiori") 
public class MapreduceDoin { 

//////////////////////////////////////////////////////// 
@SuppressWarnings( "deprecatiori") 

public static class loinReducer extends MapReduceBase 
implements ReducercText, Text, Text, Text> 

{ 

public void reduce(Text key, Iterator<Text> 
values, OutputCollectorcText, Text> output, 
Reporter reporter) throws lOException 
{ 

ArrayList<String> translist = new 
ArrayList<String>(); 

String secondvalue = 
while (values.hasNextO) 

{ 

String currValue = values.next(). 
toString().trim(); 
if(currValue.contains("trans:")){ 
String[] temp = currValue. 
split("trans:"); 
if(temp.length > l) 
translist. 
add(temp[l]); 

} 

if(currValue.contains("sec:")) 
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{ 

String[] temp = currValue. 
split("sec:"); 
if(temp.length > l) 

secondvalue = tenip[l]; 

} 

} 

for(String trans : translist) 

{ 

output.collect(key, new Text(trans 
+'\t' + secondvalue)); 

} 

} 

} 

//////////////////////////////////////////////////////// 

@SuppressWarnings("deprecation") 

public static class TransactionMapper extends 

MapReduceBase implements Mapper<LongWritable, Text, 

Text, Text> 

{ 

int indexl = 0; 

public void configure(lobConf job) { 
indexl = Integer.parseInt(job. 
get("indexl")); 

} 

public void map(LongWritable key, Text value, 
OutputCollectorcText, Text> output, Reporter 
reporter) throws lOException 
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{ 

String line = value.toStringO.trini(); 
if(line=="") return; 

String splitarray[] = line.split("\t"); 

String id = splitarray[indexl].trim(); 

String ids = "trans:" + line; 
output.collect(new Text(id), new Text(ids)); 

} 

} 

//////////////////////////////////////////////////////// 

@SuppressWarnings("deprecation") 

public static class SecondaryMapper extends 

MapReduceBase implemenis Mapper<LongWritable, Text, 

Text, Text> 

{ 

int index2 = 0; 

public void configure(lobConf job) { 
index2 = Integer.parselnt(job. 
get("index2")); 


public void map(LongWritable key, Text value, 
OutputCollectorcText, Text> output, Reporter 
reporter) throws lOException 
{ 

String line = value.toStringO.trini(); 
if(line=="") return; 

String splitarray[] = line.split("\t"); 
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String id = splitarray[index2] .trini(); 

String ids = "sec:" + line; 
output.collect(new Text(id), new Text(ids)); 

} 

} 

//////////////////////////////////////////////////////////// 

@SuppressWarnings({ "deprecation", "rawtypes", 
"unchecked" }) 

public static void main(String[] args) 
throws lOException, ClassNotFoundException, 
InterruptedException { 

// TODO Auto-generated method stub 

DobConf conf = new DobConf(); 

conf.set("indexl", args[3]); 

conf.set("index2", args[4]); 

conf.setReducerClass(DoinReducer.class); 

MultipleInputs.addInputPath(conf, new 
Path(args[o]), TextInputFormat.class, (Class<? 
extends org.apache.hadoop.mapred.Mapper>) 
TransactionMapper.class); 
MultipleInputs.addInputPath(conf, new 
Path(args[l]), TextInputFormat.class, (Class<? 
extends org.apache.hadoop.mapred.Mapper>) 
SecondaryMapper.class); 
lob job = new lob(conf); 
job.setlarByClass(Mapreduceloin.class); 
job.setlobName("MapReduceloin"); 

job.setOutputKeyClass(Text.class); 
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job.setOutput\/alueClass(Text.class); 

FileOutputFormat.setOutputPath(job, new 
Path(args[2])); 

System.out.println(j ob.waitForCompletion(true)); 

} 

} 

Spark 

After Hadoop, Spark is the next and latest revolution in big data technology. 
The major advantage of Spark is that it gives a unified interface to the entire 
big data stack. Previously, if you needed a SQL-like interface for big data, 
you would use Hive. If you needed real-time data processing, you would 
use Storm. If you wanted to build a machine learning model, you would use 
Mahout. Spark brings all these facilities under one umbrella. In addition, it 
enables in-memory computation of big data, which makes the processing 
very fast. Figure 7-5 describes all the components of Spark. 
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Figure 7-5. The components of Spark 
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Spark Core is the fundamental component of Spark. It can run on 
top of Hadoop or stand-alone. It abstracts the data set as a resilient 
distributed data set (RDD). RDD is a collection of read-only objects. 
Because it is read only, there will not be any synchronization problems 
when it is shared with multiple parallel operations. Operations on 
RDD are lazy. There are two types of operations happening on RDD: 
transformation and action. In transformation, there is no execution 
happening on a data set. Spark only Stores the sequence of operations as 
a directed acyclic graph called a lineage. AATien an action is called, then 
the actual execution takes place. After the first execution, the resuit is 
cached in memory. So, when a new execution is called, Spark makes a 
traversal of the lineage graph and makes maximum reuse of the previous 
computation, and the computation for the new operation becomes the 
minimum. This makes data processing very fast and also makes the data 
fault tolerant. If any node fails, Spark looks at the lineage graph for the 
data in that node and easily reproduces it. 

One limitation of the Hadoop framework is that it does not have any 
message-passing interface in parallel computation. But there are several 
use cases where parallel jobs need to talk with each other. Spark achieves 
this using two kinds of shared variable. They are the broadcast variable 
and the accumulator. When one job needs to send a message to all other 
jobs, the job uses the broadcast variable, and when multiple jobs want to 
aggregate their results to one place, they use an accumulator. RDD splits its 
data set into a unit called a partition. Spark provides an interface to specify 
the partition of the data, which is very effective for future operations 
like join or f ind. The user can specify the storage type of partition in 
Spark. Spark has a programming interface in Python, Java, and Scala. The 
following code is an example of a word count program in Spark: 
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val conf = new SparkConf().setAppName("wiki_test") // create a 
spark config object 

val sc = new SparkContext(conf) // Create a spark context 
val data = sc.textFile("/path/to/somedir") // Read files from 
"somedir" into an RDD of (filename, content) pairs. 
val tokens = data.flatMap(_.split(" ")) // Split each file into 
a list of tokens (words). 

val wordFreq = tokens.map((_, l)).reduceByKey(_ + _) // Add a 
count of one to each token, then sum the counts per word type. 
wordFreq.sortBy(s => -s._2).map(x => (x._2, x._l)).top(lO) 

// Get the top 10 words. Swap word and count to sort by count. 

On top of Spark Core, Spark provides the following: 

• Spark SQL, which is a SQL interface through the 
command line or a datahase connector interface. It 
also provides a SQL interface for the Spark data frame 
ohject. 

• Spark Streaming, which enahles you to process 
streaming data in real time. 

• MLih, a machine learning lihrary to huild analytical 
models on Spark data. 

• GraphX, a distrihuted graph processing framework. 


Analytics in the Cloud 

Like many other fields, analytics is heing impacted hy the cloud. It is 
affected in two ways. Big cloud providers are continuously releasing 
machine learning APIs. So, a developer can easily write a machine 
learning application without worrying ahout the underlining algorithm. 
For example, Google provides APIs for computer vision, natural language. 
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speech processing, and many more. A user can easily write code that can 
give the sentiment of an image of a face or voice in two or three lines of 
code. 

The second aspect of the cloud is in the data engineering part. 

In Chapter 1 1 gave an example of how to expose a model as a high- 
performance REST API using Falcon. Now if a million users are going to 
use it and if the load varies hy much, then autoscale is a required feature 
of this application. If you deploy the application in Google App Engine 
or AWS Lamhda, you can achieve the autoscale feature in 15 minutes. 

Once the application is autoscaled, you need to think ahout the datahase. 
DynamoDB from Amazon and Cloud Datastore hy Google are autoscaled 
datahases in the cloud. If you use one of them, your application is now 
high performance and autoscaled, hut people around glohe will access it, 
so the geographical distance will create extra latency or a negative impact 
on performance. You also have to make sure that your application is always 
availahle. Further, you need to deploy your application in three regions: 
Europe, Asia, and the United States (you can choose more regions if your 
hudget permits). Ifyou use an elastic load halancer with a geohalancing 
routing rule, which routes the traffic from a region to the app engine of 
that region, then it will he availahle across the glohe. In geohalancing, 
you can mention a secondary app engine for each rule, which makes 
your application highly availahle. If a primary app engine is down, the 
secondary app engine will take care of the things. 

Figure 7-6 describes this System. 
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Figure 7-6. The system 

In Chapter 1 1 showed some example code of publishing a deep 
learning model as a REST API. The following code is the implementation of 
the same logic in a cloud environment where the other storage is replaced 
by a Google data store: 

import falcon 

from falcon_cors import CORS 

import json 

import pygeoip 

import json 

import datetime as dt 

import ipaddress 

import math 

from concurrent.futures import * 

import numpy as np 

from google.cloud import datastore 

def logit(x): 

return (np.exp(x) / (l + np.exp(x))) 
def is_visible(client_size, ad_position): 
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y=height=0 
try: 

height = int(client_size.split(',')[1]) 
y = int(ad_position.split(',')[l]) 

except: 

pass 

if y < height: 

return "1" 

else: 

return "0" 

class Predictor(object): 

def _init_(self,dofflain,is_big): 

self.Client = datastore.Client('sulvo-east') 

self.ctr = 'ctr_' + domain 

self.ip = "ip_" + domain 

self.scores = "score_num_" + domain 

self.probabilities = "probability_num_" + domain 

if is_big: 

self.is_big = "is_big_num_" + domain 
self.scores_big = "score_big_num_" + domain 
self.probabilities_big = "probability_big_ 
num_" + domain 

self.gi = pygeoip.GeoIP( 'GeoIP.dat' ) 
self.big = is_big 
self.domain = domain 

def get_hour(selfjtimestamp): 

return dt.datetime.utcfromtimestamp(timestamp / 
le3).hour 
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def fetch_score(self, featurename, featurevalue, kind): 
pred = 0 
try: 

key = self.Client.key(kind,featurename + 
+ featurevalue) 
res= self.Client.get(key) 
if res is not None: 

pred = res['score'] 

except: 

pass 
return pred 


def get_score(self, featurename, featurevalue): 

with ThreadPoolExecutor(max_workers=5) as pool: 

future_score = pool.subniit(self .fetch_ 
score,featurename, featurevalue,self. 
scores) 

future_prob = pool.submit(self.fetch_ 
score,featurename, featurevalue,self. 
probabilities) 
if self.big: 

future_howbig = pool.submit(self. 
fetch_score,featurename, 
featurevalue,self.is_big) 
future_predbig = pool.submit(self. 
fetch_score,featurename, 
featurevalue,self.scores_big) 
future_probbig = pool.submit(self. 
fetch_score,featurename, 
featurevalue,self.probabilities_big) 
pred = future_score. resulto 
prob = future_prob. resulto 
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if not self.big: 

return pred, prob 
howbig = future_howbig.resulto 
pred_big = future_predbig.resuit() 
prob_big = future_probbig.resulto 
return howbig, pred, prob, pred_big, prob_big 

def get_value(self, f, value): 
if f == 'visible': 

fields = value.split("_") 

value = is_visible(fields[o], fields[l]) 

if f == 'ip': 

ip = str(ipaddress.IPv4Address(ipaddress. 
ip_address(value))) 

geo = self.gi.country_name_by_addr(ip) 
if self.big: 

howbigl,predl, probi, pred_bigl, 
prob_bigl = self.get_score('geo', 
geo) 

else: 

predl, probi = self.get_score('geo', 
geo) 

freq = '1' 

key = self.Client.key(self.ip,ip) 
res = self.Client.get(key) 
if res is not None: 

freq = res['ip'] 
if self.big: 

howbigZ, pred2, prob2, pred_ 
big2, prob_big2 = self.get_ 
score('frequency', freq) 
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else: 

pred2, prob2 = self.get_ 
score('frequency', freq) 
if self.big: 

return (howbigl + howbig2), (predl + 
pred2), (probi + prob2), (pred_bigl 
+ pred_big2), (prob_bigl + prob_ 
big2) 

else: 

return (predl + pred2), (probi + prob2) 
if f == 'root': 
try: 

res = Client.get('root', value) 
if res is not None: 

ctr = res['ctr'] 
avt = res['avt'] 
avv = res['avv'] 
if self.big: 

(howbigl,predl,probi,pred_ 
bigl,prob_bigl) = self. 
get_score('ctr', str(ctr)) 
(howbig2,pred2,prob2,pred_ 
big2,prob_big2) = self. 
get_score('avt', str(avt)) 
(howbigS,pred3,prob3,pred_ 
big3,prob_big3) = self. 
get_score('avv', str(avv)) 
(howbig4,pred4,prob4,pred_ 
big4,prob_big4) = self. 
get_score(f, value) 
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else: 

(predl,probi) = self.get_ 
score('ctr', str(ctr)) 
(pred2,prob2) = self.get_ 
score('avt', str(avt)) 
(pred3,prob3) = self.get_ 
score('avv', str(avv)) 
(pred4,prob4) = self.get_ 
score(f, value) 
if self.big: 

return (howbigl + howbig2 + 
howbig3 + howbig4), (predl 
+ pred2 + pred3 + pred4), 
(probi + prob2 + prob3 + 
prob4),(pred_bigl + pred_ 
big2 + pred_big3 + pred_ 
big4),(prob_bigl + prob_big2 
+ prob_big3 + prob_big4) 

else: 

return (predl + pred2 + pred3 
+ pred4), (probi + prob2 + 
prob3 + prob4) 

except: 

return 0,0 
if f == 'client_time': 

value = str(self.get_hour(int(value))) 
return self.get_score(f, value) 

def get_multiplier(self): 

key = self.Client.key('multiplier_all_nuni', 
self.domain) 

res = self.Client.get(key) 
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high = res['high'] 
low = res['low'] 
if self.big: 

key = self.Client.key('multiplier_ 
all_nuni', self.domain + "_big") 
res = self.Client.get(key) 
high_big = res['high'] 
low_big = res['low'] 
return high, low, high_big, low_big 
return high, low 

def on_post(self, req, resp): 
if True: 

input_json = json.loads(req.stream. 

read(),encoding='utf-8') 

input_json['visible'] = input_json['client_ 

size'] + + input_json['ad_position'] 

dei input_json['client_size'] 

dei input_json['ad_position'] 

howbig = 0 

pred = 0 

prob = 0 

pred_big = 0 

prob_big = 0 

worker = ThreadPoolExecutor(max_workers=l) 
thread = worker.subniit(self.get_multiplier) 
with ThreadPoolExecutor(max_workers=8) as 
pool: 

future_array = { pool.subniit(self. 
get_value,f,input_json[f]) : f for f 
in input_json} 
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for future in as_conipleted(future_ 
array): 

if self.big: 

howbigl, predl, 
probi,pred_bigl,prob_ 
bigi = future.resulto 
pred = pred + predl 
pred_big = pred_big + 
pred_bigl 

prob = prob + probi 

prob_big = prob_big + 

prob_bigl 

howbig = howbig + 

howbig 

else: 

predl, probi = future, 
resulto 

pred = pred + predl 
prob = prob + probi 

if self.big: 

if howbig > .65: 

pred, prob = pred_big, prob_ 
big 

resp.status = falcon.HTTP_200 

res = math.exp(pred)-l 
if res < 0.1: 

res =0.1 
if prob < 0.1 : 

prob = 0.1 
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if prob > 0.9: 

prob = 0.9 

if self.big: 

high, low, high_big, low_big = 
thread. resulto 
if howbig > 0.6: 

high = high_big 
low = low_big 

else: 

high, low = thread.resulto 

multiplier = low + (high -low)*prob 
res = multiplier*res 
resp.body = str(res) 

#except Exception,e: 

# print(str(e)) 

# resp.status = falcon.HTTP_200 

# resp.body = str("0.1") 

cors = CORS(allow_all_origins=True,allow_all_ 
methods=True,allow_all_headers=True) 

wsgi_app = api = falcon.API(niiddleware=[cors.niiddleware]) 

f = open('publishers 2 . 1 ist_test0 

for line in f: 

if "#" not in line: 

fields = line.stripO.split('\t0 
domain = fields[o].stripO 
big = (fields[l] .stripO == 'lO 
p = Predictor(domain, big) 
uri = '/predict/' + domain 
api.add_route(url, p) 

f .closeO 
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You can deploy this applicatiori in the Google App Engine with the 
following: 

gcloud app deploy --prject <prject id> --version <version no> 


Internet of Things 

The loT is simply the network of interconnected things/devices emhedded 
with sensors, Software, network connectivity, and necessary electronics 
that enahle them to collect and exchange data, making them responsive. 
The field is emerging with the rise of technology just like hig data, real- 
time analytics frameworks, mohile communication, and intelligent 
programmahle devices. In the loT, you can do the analysis of data on the 
server side using the techniques shown throughout the hook; you can also 
put logic on the device side using the Raspherry Pi, which is an emhedded 
System version of Python. 
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